Building a PromptOps Pipeline: From Git to Deployment

TL;DR

In the previous post I argued for treating prompts as governed data assets. This is the working implementation — every layer of the PromptOps stack, with code:

Code: github.com/raguvindtharanitharan/promptops.


A working slice, not a finished product

This isn’t a complete production platform. Drift detection, real-model evaluations, and dashboards still need to be built. But it’s a working slice that runs locally, validates on every PR, and deploys to AWS in one command. That’s the credible foundation — and the part most teams skip when they say they’re “doing PromptOps.”

The repo follows the same architecture as the serverless data platform piece. service_core/ becomes promptops_core/. The same Lambda Layer + Glue ZIP pattern carries the shared code. The same Serverless Compose orchestration ties it together. If you read that piece, this one should feel like an extension of the same architectural discipline applied to a new asset class.

The structure is the six layers from the previous PromptOps article, now in code:

1. Storage         — prompts/ directory of YAML files
2. Registry        — promptops_core/ Python module, loads + serves
3. CI/CD           — validate_prompts.py + eval_prompts.py + GitHub Actions
4. Runtime         — Lambda + Glue consume the registry
5. Observability   — log_prompt_event() emits structured JSON per call
6. Deploy          — Serverless Compose deploys it all
PromptOps reference architecture: Source & CI layer at the top, artifacts produced from promptops_core in the middle, Lambda and Glue consumers below, CloudWatch observability layer at the bottom.
Reference architecture — same shape as the serverless data platform piece, applied to prompts. promptops_core packaged once, consumed by Lambda and Glue, with structured observability flowing back.

I’ll walk through each in order.


1. Storage — prompts as YAML

Prompts live in version control as YAML files, organized by domain:

prompts/
└── customer_support/
    └── intent_classifier.yaml

Each file is a self-contained record:

id: customer-support.intent-classifier
version: 2
owner: data-platform-team
model: claude-3-5-sonnet
parent: 1
tags:
  - classification
  - customer-support
metrics:
  baseline_accuracy: 0.94
  cost_per_call_usd: 0.003
template: |
  You are a customer support intent classifier.
  Given the user message and the previous conversation context,
  classify the intent into exactly one of:
  - billing      (questions about charges, invoices, refunds)
  - technical    (product not working, errors, bugs)
  - sales        (pricing, plans, upgrades, new accounts)
  - retention    (cancellation, downgrade, dissatisfaction)
  - other        (anything that doesn't fit above)
  Respond with only the single category label, lowercase, no
  punctuation or explanation.
created: 2026-04-15
updated: 2026-05-01

Why YAML, not JSON or code? Three reasons:

The required fields aren’t decorative. Each one closes a specific failure mode I’ve watched teams hit:

FieldCloses the failure mode where…
idTwo teams independently named their prompt the same thing
versionA prompt change broke prod, and rolling back the deploy didn’t help
ownerNobody knew who could approve a change
modelThe same prompt was secretly running on three different models
parentLineage was lost between iterations
tagsSearch by domain became impossible past 50 prompts
metrics”Accuracy” was discussed but never measured

Validation enforces these at PR time. No prompt reaches main without them.


2. Registry — same import, two runtimes

The registry is one Python module (promptops_core/) that loads prompts and returns immutable Prompt objects:

# promptops_core/registry.py
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import yaml

PROMPTS_DIR = Path(__file__).resolve().parent.parent / "prompts"


@dataclass(frozen=True)
class Prompt:
    id: str
    version: int
    owner: str
    model: str
    template: str
    tags: tuple[str, ...] = ()
    parent: int | None = None
    metrics: dict[str, Any] = field(default_factory=dict)


def load_prompt(prompt_id: str) -> Prompt:
    path = _id_to_path(prompt_id)
    if not path.exists():
        raise FileNotFoundError(f"Prompt '{prompt_id}' not registered")

    data = yaml.safe_load(path.read_text())
    return Prompt(
        id=data["id"],
        version=data["version"],
        owner=data["owner"],
        model=data["model"],
        template=data["template"],
        tags=tuple(data.get("tags", [])),
        parent=data.get("parent"),
        metrics=dict(data.get("metrics", {})),
    )

Three design choices worth flagging:

This is the load side. The publish side — packaging — is the part that earns the “two runtimes” claim.

The build_artifacts.py script produces two ZIPs from the same source:

Both contain the same Python module, the same prompt YAML, the same pyyaml runtime dep — packaged differently per runtime. Same source, two packagings. The pattern is identical to the serverless data platform piece — it just happens to apply equally well to prompts.


3. CI/CD — schema gate plus behavioral gate

Two scripts, two gates.

The schema gate (scripts/validate_prompts.py) checks every prompt YAML for required fields, types, and an id-matches-path consistency:

REQUIRED_FIELDS: dict[str, type] = {
    "id": str, "version": int, "owner": str,
    "model": str, "template": str, "tags": list,
}


def validate_file(path, prompts_root):
    data = yaml.safe_load(path.read_text())
    errors = []
    for field_name, expected_type in REQUIRED_FIELDS.items():
        if field_name not in data:
            errors.append(f"{path}: missing field '{field_name}'")
        elif not isinstance(data[field_name], expected_type):
            errors.append(f"{path}: '{field_name}' must be {expected_type.__name__}")

    expected_id = _expected_id_from_path(path, prompts_root)
    if data.get("id") != expected_id:
        errors.append(f"{path}: id doesn't match path")
    return errors

The path-vs-id check is a small detail with outsized value. It catches the bug class where someone renames a file but forgets to update the id field, or vice versa. CI blocks the PR — the prompt’s identity stays consistent between filesystem and metadata.

The behavioral gate (scripts/eval_prompts.py) runs each prompt against test cases, computes accuracy, and fails the build if accuracy drops below baseline_accuracy declared in the prompt’s metadata:

def evaluate_prompt(prompt_id, eval_path):
    cases = yaml.safe_load(eval_path.read_text())["cases"]
    prompt = load_prompt(prompt_id)
    passes = sum(
        1 for case in cases
        if classify_stub(prompt.template, case["input"]) == case["expected"]
    )
    return passes / len(cases)


def main():
    for eval_file in sorted(EVAL_DIR.rglob("*.yaml")):
        data = yaml.safe_load(eval_file.read_text())
        prompt = load_prompt(data["prompt_id"])
        accuracy = evaluate_prompt(prompt.id, eval_file)
        baseline = prompt.metrics.get("baseline_accuracy")
        if baseline is not None and accuracy < baseline:
            return 1   # CI fails
    return 0

In the demo I stub the model with a keyword classifier so the eval runs without API keys. In real usage, classify_stub() becomes a Bedrock or Anthropic call, and the test cases come from a held-out evaluation set per prompt. The framework is identical either way — the only change is what’s behind the model boundary.

Both gates run in GitHub Actions on every PR:

# .github/workflows/validate.yml
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.10" }
      - run: pip install -e ".[dev]"
      - run: python scripts/validate_prompts.py
      - run: python scripts/eval_prompts.py

A failed schema check or a regressed eval blocks the merge. No prompt with missing metadata reaches main. No prompt that degrades accuracy ships to prod. The pipeline is the discipline.


4. Runtime — Lambda and Glue using the same import

Both runtimes consume the registry through the same import statement. That’s the architectural punchline:

from promptops_core.registry import load_prompt
from promptops_core.observability import log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED

What changes is the packaging mechanic, invisible to the import.

Lambda consumes via Lambda Layer. The Layer ZIP unpacks at /opt/python/, so the runtime sees /opt/python/promptops_core/ and /opt/python/prompts/. Standard Python search path picks them up:

# examples/lambda/hello_world_lambda.py
import time
import json

from promptops_core.registry import load_prompt
from promptops_core.observability import (
    log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED
)

PROMPT_ID = "customer-support.intent-classifier"


def handler(event, context):
    user_message = event.get("message", "")
    prompt = load_prompt(PROMPT_ID)
    request_id = getattr(context, "aws_request_id", "local")

    log_prompt_event(EVENT_INVOKED, prompt,
                     request_id=request_id,
                     user_message_length=len(user_message))

    start = time.perf_counter()
    intent = "billing"  # Stub — replace with model call
    latency_ms = int((time.perf_counter() - start) * 1000)

    log_prompt_event(EVENT_COMPLETED, prompt,
                     request_id=request_id,
                     latency_ms=latency_ms,
                     intent=intent)

    return {
        "statusCode": 200,
        "body": json.dumps({
            "prompt_id": prompt.id,
            "prompt_version": prompt.version,
            "intent": intent,
        }),
    }

Glue consumes via --extra-py-files. The Glue ZIP gets uploaded to S3 and referenced in the job’s DefaultArguments:

# examples/serverless.yml (excerpt)
HelloWorldGlueJob:
  Type: AWS::Glue::Job
  Properties:
    Command:
      ScriptLocation: s3://.../glue/hello_world_data_extractor.py
    DefaultArguments:
      "--extra-py-files": s3://.../glue-libraries/promptops_glue.zip
      "--service-name": hello-world

The Glue script imports the registry the same way:

# examples/glue/hello_world_data_extractor.py
import sys
from awsglue.utils import getResolvedOptions

from promptops_core.registry import load_prompt
from promptops_core.observability import (
    log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED
)


def main():
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    prompt = load_prompt("customer-support.intent-classifier")

    log_prompt_event(EVENT_INVOKED, prompt,
                     job_name=args["JOB_NAME"], runtime="glue")

    # ... process records, call model per row ...

    log_prompt_event(EVENT_COMPLETED, prompt,
                     job_name=args["JOB_NAME"],
                     runtime="glue",
                     records_processed=0)

Identical imports, identical patterns. The job code doesn’t know which runtime it’s in. That’s the point.


5. Observability — one event helper, every invocation

Every prompt call emits at least one structured event. The schema is small and consistent:

# promptops_core/observability.py
import json
import sys

EVENT_INVOKED = "prompt_invoked"
EVENT_COMPLETED = "prompt_completed"
EVENT_FAILED = "prompt_failed"


def log_prompt_event(event_type, prompt, **extra):
    payload = {
        "event": event_type,
        "prompt_id": prompt.id,
        "prompt_version": prompt.version,
        "prompt_owner": prompt.owner,
        "prompt_model": prompt.model,
    }
    payload.update(extra)
    print(json.dumps(payload), file=sys.stdout)

Five required fields, plus arbitrary extras (latency_ms, tokens, error, request_id, runtime). Every invocation, every runtime, same shape.

Why this matters concretely:

The discipline is: never call a prompt without logging the prompt id and version with it. The helper makes that the path of least resistance — drop in two lines, every call is traceable forever.


6. Deploy — one command, two services

Deployment uses Serverless Compose, the same orchestration pattern as the serverless data platform piece:

# serverless-compose.yml
services:
  promptops-core:
    path: ./promptops_core
  promptops-examples:
    path: ./examples
    dependsOn: promptops-core

promptops-core runs first. Its serverless.yml:

  1. Calls python ../scripts/build_artifacts.py (build hook, before package)
  2. Publishes the Lambda Layer from dist/promptops_layer.zip
  3. Uploads dist/promptops_glue.zip to S3 (after deploy)
  4. Exports the Layer ARN via CloudFormation outputs

promptops-examples runs second. Its serverless.yml:

  1. Uploads the Glue script to S3 (before deploy)
  2. Defines the hello-world Lambda, attached to the Layer ARN imported from promptops-core
  3. Defines the Glue job, referencing the Glue ZIP via --extra-py-files

A single command deploys everything, in dependency order:

npm run deploy:dev

What this gives you:


What I’ve stopped doing in PromptOps repos

Five patterns I’ve stopped tolerating, one for each of the gaps the discipline closes:


What’s left

This is a working slice. Phase 4 — coming in the next post — adds:

What I’m exploring next: agent integration. When a prompt is composed and called by an autonomous agent rather than a deterministic Lambda, the observability discipline matters more, not less. The volume goes up two orders of magnitude. The same event schema scales. The same versioning discipline scales. The platform is the foundation; agents are what sit on top.

Build the platform first. Then make it smart.


Code: github.com/raguvindtharanitharan/promptops

Previous in this series: