Building a PromptOps Pipeline: From Git to Deployment

11 Mar 2026 · 13 min read

promptops
ai-native
devops
ci-cd
aws

TL;DR

In the previous post I argued for treating prompts as governed data assets. This is the working implementation — every layer of the PromptOps stack, with code:

Storage — YAML schema with required metadata
Registry — a Prompt dataclass + load_prompt(id) loader, packaged once for Lambda Layer + Glue --extra-py-files
CI/CD — schema validator + behavioral eval, wired to GitHub Actions, blocking merges on failure
Runtime — Lambda + Glue examples using the same import path
Observability — log_prompt_event() helper emitting structured JSON for every invocation
Deploy — Serverless Compose orchestrating Layer publish + Lambda + Glue in one command

Code: github.com/raguvindtharanitharan/promptops.

A working slice, not a finished product

This isn’t a complete production platform. Drift detection, real-model evaluations, and dashboards still need to be built. But it’s a working slice that runs locally, validates on every PR, and deploys to AWS in one command. That’s the credible foundation — and the part most teams skip when they say they’re “doing PromptOps.”

The repo follows the same architecture as the serverless data platform piece. service_core/ becomes promptops_core/. The same Lambda Layer + Glue ZIP pattern carries the shared code. The same Serverless Compose orchestration ties it together. If you read that piece, this one should feel like an extension of the same architectural discipline applied to a new asset class.

The structure is the six layers from the previous PromptOps article, now in code:

1. Storage         — prompts/ directory of YAML files
2. Registry        — promptops_core/ Python module, loads + serves
3. CI/CD           — validate_prompts.py + eval_prompts.py + GitHub Actions
4. Runtime         — Lambda + Glue consume the registry
5. Observability   — log_prompt_event() emits structured JSON per call
6. Deploy          — Serverless Compose deploys it all

PromptOps reference architecture: Source & CI layer at the top, artifacts produced from promptops_core in the middle, Lambda and Glue consumers below, CloudWatch observability layer at the bottom. — Reference architecture — same shape as the serverless data platform piece, applied to prompts. promptops_core packaged once, consumed by Lambda and Glue, with structured observability flowing back.

I’ll walk through each in order.

1. Storage — prompts as YAML

Prompts live in version control as YAML files, organized by domain:

prompts/
└── customer_support/
    └── intent_classifier.yaml

Each file is a self-contained record:

id: customer-support.intent-classifier
version: 2
owner: data-platform-team
model: claude-3-5-sonnet
parent: 1
tags:
  - classification
  - customer-support
metrics:
  baseline_accuracy: 0.94
  cost_per_call_usd: 0.003
template: |
  You are a customer support intent classifier.
  Given the user message and the previous conversation context,
  classify the intent into exactly one of:
  - billing      (questions about charges, invoices, refunds)
  - technical    (product not working, errors, bugs)
  - sales        (pricing, plans, upgrades, new accounts)
  - retention    (cancellation, downgrade, dissatisfaction)
  - other        (anything that doesn't fit above)
  Respond with only the single category label, lowercase, no
  punctuation or explanation.
created: 2026-04-15
updated: 2026-05-01

Why YAML, not JSON or code? Three reasons:

Multi-line strings. YAML’s | block lets the prompt template breathe. JSON requires \n escapes; Python source requires triple-quoted strings inside a class. Neither survives diff review well.
Schema without code. Anyone can edit a YAML file and submit a PR. Product managers, content writers, even legal can review prompt changes. PromptOps governance shouldn’t gatekeep on engineering language.
Naturally diff-friendly. Reviewers see exactly which line of the template changed between version 1 and version 2.

The required fields aren’t decorative. Each one closes a specific failure mode I’ve watched teams hit:

Field	Closes the failure mode where…
`id`	Two teams independently named their prompt the same thing
`version`	A prompt change broke prod, and rolling back the deploy didn’t help
`owner`	Nobody knew who could approve a change
`model`	The same prompt was secretly running on three different models
`parent`	Lineage was lost between iterations
`tags`	Search by domain became impossible past 50 prompts
`metrics`	”Accuracy” was discussed but never measured

Validation enforces these at PR time. No prompt reaches main without them.

2. Registry — same import, two runtimes

The registry is one Python module (promptops_core/) that loads prompts and returns immutable Prompt objects:

# promptops_core/registry.py
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import yaml

PROMPTS_DIR = Path(__file__).resolve().parent.parent / "prompts"


@dataclass(frozen=True)
class Prompt:
    id: str
    version: int
    owner: str
    model: str
    template: str
    tags: tuple[str, ...] = ()
    parent: int | None = None
    metrics: dict[str, Any] = field(default_factory=dict)


def load_prompt(prompt_id: str) -> Prompt:
    path = _id_to_path(prompt_id)
    if not path.exists():
        raise FileNotFoundError(f"Prompt '{prompt_id}' not registered")

    data = yaml.safe_load(path.read_text())
    return Prompt(
        id=data["id"],
        version=data["version"],
        owner=data["owner"],
        model=data["model"],
        template=data["template"],
        tags=tuple(data.get("tags", [])),
        parent=data.get("parent"),
        metrics=dict(data.get("metrics", {})),
    )

Three design choices worth flagging:

@dataclass(frozen=True). A loaded prompt is immutable. Callers can’t mutate the cached version and accidentally pollute downstream consumers. The “prompts as data assets” thesis shows up at the type level.
Tuples instead of lists for tags. Frozen dataclasses require hashable fields. YAML lists get coerced to tuples on the way in.
Loud failure on miss. Missing prompt id raises FileNotFoundError. No silent fallbacks, no “default prompt” — a missing id is a programmer error, surfaced at the point of failure.

This is the load side. The publish side — packaging — is the part that earns the “two runtimes” claim.

The build_artifacts.py script produces two ZIPs from the same source:

dist/promptops_layer.zip — Lambda Layer convention (python/promptops_core/, python/prompts/, python/yaml/)
dist/promptops_glue.zip — Glue --extra-py-files convention (flat: promptops_core/, prompts/, yaml/)

Both contain the same Python module, the same prompt YAML, the same pyyaml runtime dep — packaged differently per runtime. Same source, two packagings. The pattern is identical to the serverless data platform piece — it just happens to apply equally well to prompts.

3. CI/CD — schema gate plus behavioral gate

Two scripts, two gates.

The schema gate (scripts/validate_prompts.py) checks every prompt YAML for required fields, types, and an id-matches-path consistency:

REQUIRED_FIELDS: dict[str, type] = {
    "id": str, "version": int, "owner": str,
    "model": str, "template": str, "tags": list,
}


def validate_file(path, prompts_root):
    data = yaml.safe_load(path.read_text())
    errors = []
    for field_name, expected_type in REQUIRED_FIELDS.items():
        if field_name not in data:
            errors.append(f"{path}: missing field '{field_name}'")
        elif not isinstance(data[field_name], expected_type):
            errors.append(f"{path}: '{field_name}' must be {expected_type.__name__}")

    expected_id = _expected_id_from_path(path, prompts_root)
    if data.get("id") != expected_id:
        errors.append(f"{path}: id doesn't match path")
    return errors

The path-vs-id check is a small detail with outsized value. It catches the bug class where someone renames a file but forgets to update the id field, or vice versa. CI blocks the PR — the prompt’s identity stays consistent between filesystem and metadata.

The behavioral gate (scripts/eval_prompts.py) runs each prompt against test cases, computes accuracy, and fails the build if accuracy drops below baseline_accuracy declared in the prompt’s metadata:

def evaluate_prompt(prompt_id, eval_path):
    cases = yaml.safe_load(eval_path.read_text())["cases"]
    prompt = load_prompt(prompt_id)
    passes = sum(
        1 for case in cases
        if classify_stub(prompt.template, case["input"]) == case["expected"]
    )
    return passes / len(cases)


def main():
    for eval_file in sorted(EVAL_DIR.rglob("*.yaml")):
        data = yaml.safe_load(eval_file.read_text())
        prompt = load_prompt(data["prompt_id"])
        accuracy = evaluate_prompt(prompt.id, eval_file)
        baseline = prompt.metrics.get("baseline_accuracy")
        if baseline is not None and accuracy < baseline:
            return 1   # CI fails
    return 0

In the demo I stub the model with a keyword classifier so the eval runs without API keys. In real usage, classify_stub() becomes a Bedrock or Anthropic call, and the test cases come from a held-out evaluation set per prompt. The framework is identical either way — the only change is what’s behind the model boundary.

Both gates run in GitHub Actions on every PR:

# .github/workflows/validate.yml
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.10" }
      - run: pip install -e ".[dev]"
      - run: python scripts/validate_prompts.py
      - run: python scripts/eval_prompts.py

A failed schema check or a regressed eval blocks the merge. No prompt with missing metadata reaches main. No prompt that degrades accuracy ships to prod. The pipeline is the discipline.

4. Runtime — Lambda and Glue using the same import

Both runtimes consume the registry through the same import statement. That’s the architectural punchline:

from promptops_core.registry import load_prompt
from promptops_core.observability import log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED

What changes is the packaging mechanic, invisible to the import.

Lambda consumes via Lambda Layer. The Layer ZIP unpacks at /opt/python/, so the runtime sees /opt/python/promptops_core/ and /opt/python/prompts/. Standard Python search path picks them up:

# examples/lambda/hello_world_lambda.py
import time
import json

from promptops_core.registry import load_prompt
from promptops_core.observability import (
    log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED
)

PROMPT_ID = "customer-support.intent-classifier"


def handler(event, context):
    user_message = event.get("message", "")
    prompt = load_prompt(PROMPT_ID)
    request_id = getattr(context, "aws_request_id", "local")

    log_prompt_event(EVENT_INVOKED, prompt,
                     request_id=request_id,
                     user_message_length=len(user_message))

    start = time.perf_counter()
    intent = "billing"  # Stub — replace with model call
    latency_ms = int((time.perf_counter() - start) * 1000)

    log_prompt_event(EVENT_COMPLETED, prompt,
                     request_id=request_id,
                     latency_ms=latency_ms,
                     intent=intent)

    return {
        "statusCode": 200,
        "body": json.dumps({
            "prompt_id": prompt.id,
            "prompt_version": prompt.version,
            "intent": intent,
        }),
    }

Glue consumes via --extra-py-files. The Glue ZIP gets uploaded to S3 and referenced in the job’s DefaultArguments:

# examples/serverless.yml (excerpt)
HelloWorldGlueJob:
  Type: AWS::Glue::Job
  Properties:
    Command:
      ScriptLocation: s3://.../glue/hello_world_data_extractor.py
    DefaultArguments:
      "--extra-py-files": s3://.../glue-libraries/promptops_glue.zip
      "--service-name": hello-world

The Glue script imports the registry the same way:

# examples/glue/hello_world_data_extractor.py
import sys
from awsglue.utils import getResolvedOptions

from promptops_core.registry import load_prompt
from promptops_core.observability import (
    log_prompt_event, EVENT_INVOKED, EVENT_COMPLETED
)


def main():
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    prompt = load_prompt("customer-support.intent-classifier")

    log_prompt_event(EVENT_INVOKED, prompt,
                     job_name=args["JOB_NAME"], runtime="glue")

    # ... process records, call model per row ...

    log_prompt_event(EVENT_COMPLETED, prompt,
                     job_name=args["JOB_NAME"],
                     runtime="glue",
                     records_processed=0)

Identical imports, identical patterns. The job code doesn’t know which runtime it’s in. That’s the point.

5. Observability — one event helper, every invocation

Every prompt call emits at least one structured event. The schema is small and consistent:

# promptops_core/observability.py
import json
import sys

EVENT_INVOKED = "prompt_invoked"
EVENT_COMPLETED = "prompt_completed"
EVENT_FAILED = "prompt_failed"


def log_prompt_event(event_type, prompt, **extra):
    payload = {
        "event": event_type,
        "prompt_id": prompt.id,
        "prompt_version": prompt.version,
        "prompt_owner": prompt.owner,
        "prompt_model": prompt.model,
    }
    payload.update(extra)
    print(json.dumps(payload), file=sys.stdout)

Five required fields, plus arbitrary extras (latency_ms, tokens, error, request_id, runtime). Every invocation, every runtime, same shape.

Why this matters concretely:

CloudWatch Logs Insights queries become trivial: filter prompt_id = "customer-support.intent-classifier" and prompt_version = 2. No regex parsing of free-form strings.
Drift detection (Phase 4) reads these events directly — no separate metric pipeline needed.
Per-prompt cost reporting: aggregate tokens field by prompt_id, multiply by model cost, finance dashboard done.
Triage: when a model output looks wrong, the request_id lets you find the exact prompt + version + invocation context in one query.

The discipline is: never call a prompt without logging the prompt id and version with it. The helper makes that the path of least resistance — drop in two lines, every call is traceable forever.

6. Deploy — one command, two services

Deployment uses Serverless Compose, the same orchestration pattern as the serverless data platform piece:

# serverless-compose.yml
services:
  promptops-core:
    path: ./promptops_core
  promptops-examples:
    path: ./examples
    dependsOn: promptops-core

promptops-core runs first. Its serverless.yml:

Calls python ../scripts/build_artifacts.py (build hook, before package)
Publishes the Lambda Layer from dist/promptops_layer.zip
Uploads dist/promptops_glue.zip to S3 (after deploy)
Exports the Layer ARN via CloudFormation outputs

promptops-examples runs second. Its serverless.yml:

Uploads the Glue script to S3 (before deploy)
Defines the hello-world Lambda, attached to the Layer ARN imported from promptops-core
Defines the Glue job, referencing the Glue ZIP via --extra-py-files

A single command deploys everything, in dependency order:

npm run deploy:dev

What this gives you:

One source-of-truth pipeline. Prompt YAML changes → CI runs both gates → if both pass, merge → deploy bumps the Lambda Layer version → all consumers pick up the new version on their next deploy.
Versioning is deploy-time, not runtime. A Lambda function is pinned to a specific Layer version. Rolling back the deploy rolls back the prompt — the bug class from the previous post (rolling back code without rolling back the prompt) goes away.
Same artifact serves Lambda and Glue. Update the prompt once, both runtimes get it. No drift between “the Lambda version” and “the Glue version” of a prompt.

What I’ve stopped doing in PromptOps repos

Five patterns I’ve stopped tolerating, one for each of the gaps the discipline closes:

Skipping the schema validator on small repos. Even with one prompt, the validator runs in 100ms and prevents the future bug where someone forgets owner and ships a 50-line YAML that nobody can support.
Stubbing eval until “we have a real model.” Stub it now with anything deterministic. The eval scaffold matters more than the eval contents — once teams see CI fail on accuracy, they care about accuracy.
Inline prompts in handler code. I’d rather see a missing import than a string literal in a Lambda. Make the registry the only path.
One repo without a clear blast radius. If multiple teams own prompts, multiple repos. If one team, one repo. Mixing them creates cross-team approval bottlenecks no PromptOps tool can solve.
Treating observability as optional. If log_prompt_event isn’t called, the entire stack is invisible at runtime. Make it a code-review requirement on the consumer side.

What’s left

This is a working slice. Phase 4 — coming in the next post — adds:

Drift detection — read live invocation events from CloudWatch, compare distributions (cost, latency, output mix) against baselines from prompt metadata, alert on regressions
Behavioral eval against real models — replace the stub classifier with a Bedrock or Anthropic client, generate synthetic test cases per prompt, track accuracy across versions
CloudWatch dashboards — pre-built dashboards consuming the structured events, queryable by prompt_id + prompt_version
Multi-tenant prompt registries — when one team’s prompts shouldn’t be visible or editable by another, the registry needs scopes

What I’m exploring next: agent integration. When a prompt is composed and called by an autonomous agent rather than a deterministic Lambda, the observability discipline matters more, not less. The volume goes up two orders of magnitude. The same event schema scales. The same versioning discipline scales. The platform is the foundation; agents are what sit on top.

Build the platform first. Then make it smart.

Code: github.com/raguvindtharanitharan/promptops

Previous in this series:

Structuring a Serverless Data Platform on AWS — the architectural pattern this builds on
PromptOps: The DataOps Playbook for Prompts — the argument this implements