Skip to content

The DevOps Delivery Pipeline: End-to-End Framework

A DevOps framework is not a single tool or process — it is an end-to-end system that connects your product idea to production software, with quality and security built in at every step.


Why You Need a Framework, Not Just Tools

Most teams adopt DevOps by collecting tools: "We use Jenkins for CI, Docker for containers, and Kubernetes for deployment." This is tool-first thinking, and it leads to fragmented pipelines where nobody really knows how a commit becomes a production release.

A framework is different. It defines:

  • Stages — what happens at each phase of the journey
  • Gates — what must be true before moving to the next stage
  • Feedback loops — how information flows back from later stages to earlier ones
  • Ownership — who is responsible for each stage
Idea → Plan → Code → Build → Test → Release → Deploy → Operate → Monitor
  ↑__________________feedback____________________________________|

The feedback loop is the most important part. A DevOps framework without feedback is just a waterfall with Docker.


Stage 1: Team Structure and Product Discovery

The Myth of the "DevOps Team"

One of the most common mistakes is creating a dedicated DevOps team that acts as a gatekeeper. This recreates the wall of confusion with a new label. Instead, DevOps principles work best with cross-functional product teams that own the full lifecycle of their services.

A healthy product team looks like this:

Role Responsibility Anti-Pattern
Product Owner Define what to build and why Writing tickets without talking to users
Designer (UX/UI) Understand users, design solutions Designing in a silo, handing off to devs
Backend Engineer Build services, APIs, data layers "It works on my machine"
Frontend Engineer Build user interfaces Not caring about API contracts
Security Engineer Embed security from day one Reviewing only at the end
Platform/SRE Build developer tooling, maintain reliability Owning Kubernetes, not enabling teams
QA Engineer Verify quality at every stage Writing tests only after features ship

The Why–What–How Model

Before writing a single line of code, teams should align on three levels:

WHY   → Business problem / user need / strategic goal
         "We lose 20% of users at checkout because the payment form is too complex"

WHAT  → The solution space / product requirement
         "A one-click payment flow using saved payment methods"

HOW   → Technical implementation
         "Add a /payments/quick-checkout endpoint that uses tokenized cards from Stripe"

Skipping the WHY leads to teams that build the wrong thing extremely efficiently.

Discovery Tooling

Purpose Tools
Portfolio / Epic tracking Jira, Linear, Shortcut
Documentation & wikis Confluence, Notion, Backstage
Diagrams & architecture Miro, draw.io, Excalidraw
Security architecture Threat Dragon, OWASP Threat Modeling
API design Stoplight, Swagger Editor, Redocly

Stage 2: Source Code Management

Everything as Code

Modern DevOps treats everything as code — not just application logic, but also infrastructure, tests, security policies, and documentation. This is the foundation of reproducibility.

repository/
├── src/                  # Application code
│   ├── api/
│   ├── services/
│   └── models/
├── tests/                # All test types
│   ├── unit/
│   ├── integration/
│   ├── e2e/
│   └── performance/
├── infra/                # Infrastructure as Code
│   ├── terraform/
│   └── helm/
├── .github/workflows/    # CI/CD pipelines
├── .pre-commit-config.yaml
├── Dockerfile
└── docker-compose.yml

Branching Strategies

Choose a branching strategy based on your team's release cadence:

main (trunk)
  ├── feature/add-payment (short-lived, < 2 days)
  ├── feature/user-profile (short-lived, < 2 days)
  └── tags: v1.2.0, v1.2.1, v1.3.0
  • All developers integrate to main daily
  • Feature flags hide incomplete work
  • Best for teams with strong CI and high test coverage
  • Reduces merge conflicts dramatically

main │ ├── feature/checkout-redesign │ └── PR → code review → merge ├── fix/payment-timeout │ └── PR → code review → merge

  • Simple: main is always deployable
  • Every change goes through a PR
  • Good for web teams with continuous deployment
main ─────────────────────────────────────────
                ↑ merge at release
release/1.2 ───────────────
            ↑ cut
develop ─────────────────────────────────────
          ↑ merge feature branches
feature/* ────────
hotfix/*  ──── (cherry-pick to main and develop)
  • Useful for products with scheduled releases
  • Higher ceremony, more merge conflicts
  • Avoid if you can deploy frequently

Code Quality Gates (Pre-Commit)

Stop bad code before it hits the repository:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key        # Catch secrets early
      - id: check-added-large-files
        args: ['--maxkb=500']

  - repo: https://github.com/psf/black
    rev: 24.3.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.3.0
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
        args: ['--baseline', '.secrets.baseline']

  - repo: https://github.com/python-jsonschema/check-jsonschema
    rev: 0.28.0
    hooks:
      - id: check-github-workflows

Static Analysis Tools

Tool What It Finds When to Use
SonarQube Code smells, bugs, coverage, duplication CI pipeline, PR gate
Snyk Open-source vulnerabilities (SCA) CI pipeline, nightly
Checkmarx SAST — security vulnerabilities in code CI pipeline
Semgrep Custom security rules, fast SAST Pre-commit, CI
Veracode Enterprise SAST/DAST with compliance reports Regulated industries
Black Duck License compliance + OSS vulnerabilities Legal/compliance gate

Stage 3: The Build Phase

What Happens in a Build

A build transforms source code into deployable artifacts. It must be:

  • Reproducible — same inputs always produce same outputs
  • Fast — slow builds break flow
  • Isolated — no dependency on developer's local machine
# Example: pyproject.toml with pinned dependencies (reproducible builds)
[tool.poetry]
name = "payment-service"
version = "0.1.0"

[tool.poetry.dependencies]
python = "^3.12"
fastapi = "0.110.0"
sqlalchemy = "2.0.28"
pydantic = "2.6.3"

[tool.poetry.group.dev.dependencies]
pytest = "7.4.4"
pytest-cov = "4.1.0"
black = "24.3.0"

Build Tool Selection

Ecosystem Build Tool Package Registry
Python pip, Poetry, uv PyPI, private Nexus/JFrog
Java Maven, Gradle Maven Central, Artifactory
JavaScript/TS npm, pnpm, yarn npmjs, Verdaccio
Go go build pkg.go.dev, GOPROXY
.NET MSBuild, dotnet NuGet, Artifactory

Testing During Build: The Testing Pyramid

        /\
       /  \
      / E2E \         ← Few, slow, expensive (Selenium, Playwright)
     /──────\
    /  Integ  \       ← Some (API tests, DB tests, contract tests)
   /────────────\
  /   Unit Tests  \   ← Many, fast, cheap (pytest, JUnit)
 /──────────────────\
/  Static Analysis   \ ← All code, always (SonarQube, Semgrep)

Unit test example (Python/pytest):

# tests/unit/test_payment_service.py
import pytest
from decimal import Decimal
from services.payment import PaymentService, InsufficientFundsError

class TestPaymentService:
    def test_process_payment_success(self):
        service = PaymentService(gateway=MockGateway())
        result = service.process(amount=Decimal("99.99"), currency="USD")
        assert result.status == "approved"
        assert result.transaction_id is not None

    def test_process_payment_insufficient_funds(self):
        gateway = MockGateway(decline_reason="insufficient_funds")
        service = PaymentService(gateway=gateway)

        with pytest.raises(InsufficientFundsError):
            service.process(amount=Decimal("999.99"), currency="USD")

    def test_process_payment_invalid_amount(self):
        service = PaymentService(gateway=MockGateway())
        with pytest.raises(ValueError, match="Amount must be positive"):
            service.process(amount=Decimal("-10.00"), currency="USD")

Artifact Storage Strategy

Never rebuild the same code twice. Store built artifacts in a registry:

Build → Artifact Registry (Nexus / JFrog Artifactory / GitHub Packages)
           ├── Python wheels (.whl)
           ├── Java JARs
           ├── npm packages
           └── Container images (see Stage 4)

Why artifact registries matter:

  1. Build once, deploy many times — same binary to dev/staging/prod
  2. Rollbacks are instant — the old artifact is still there
  3. Security scanning happens once, not per environment
  4. License compliance is tracked centrally

Stage 4: Containerization and Image Security

The Container Build Pipeline

Building a container image is more than running docker build. A production-grade image pipeline looks like this:

Source Code
Dockerfile / Buildpacks
Build → OCI Image
SBOM Generation (CycloneDX / SPDX)
SCA Scan (Trivy / Snyk / Black Duck)
    ├── PASS → Push to Registry
    └── FAIL → Block pipeline, notify team
Container Registry (Harbor / ECR / Docker Hub / jFrog)
Image Signing (Cosign / Notary)

Writing Secure Dockerfiles

FROM python:latest           # unpinned = non-reproducible

WORKDIR /app
COPY . .
RUN pip install -r requirements.txt

RUN apt-get install -y curl vim wget  # unnecessary tools

CMD ["python", "app.py"]

# Problems: runs as root, unpinned base, installs debug tools
# Stage 1: Build dependencies
FROM python:3.12.3-slim AS builder

WORKDIR /build
COPY pyproject.toml poetry.lock ./

RUN pip install --no-cache-dir poetry==1.8.2 && \
    poetry export -f requirements.txt --output requirements.txt --without-hashes

# Stage 2: Runtime image
FROM python:3.12.3-slim AS runtime

# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# Install only runtime dependencies
COPY --from=builder /build/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
    rm -rf /root/.cache

COPY --chown=appuser:appuser src/ ./src/

USER appuser

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

CMD ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]

SBOM: Software Bill of Materials

An SBOM is an ingredient list for your software. It tells you exactly what open-source components are inside your container image — critical for vulnerability response and license compliance.

# Generate SBOM with Syft
syft payment-service:v1.2.0 -o cyclonedx-json > sbom.json

# Scan SBOM for vulnerabilities with Grype
grype sbom:./sbom.json --fail-on high

# Scan with Trivy
trivy image --severity HIGH,CRITICAL payment-service:v1.2.0

# Sign the image after scanning passes
cosign sign --key cosign.key payment-service:v1.2.0

Sample Trivy output interpretation:

payment-service:v1.2.0 (debian 12.5)

Total: 3 (HIGH: 2, CRITICAL: 1)

┌──────────────┬────────────────┬──────────┬───────────────────┬──────────────────────┐
│   Library    │ Vulnerability  │ Severity │ Installed Version │    Fixed Version     │
├──────────────┼────────────────┼──────────┼───────────────────┼──────────────────────┤
│ libssl3      │ CVE-2024-XXXX  │ CRITICAL │ 3.0.11-1          │ 3.0.13-1             │
│ cryptography │ CVE-2024-YYYY  │ HIGH     │ 42.0.0            │ 42.0.5               │
│ requests     │ CVE-2024-ZZZZ  │ HIGH     │ 2.31.0            │ 2.32.0               │
└──────────────┴────────────────┴──────────┴───────────────────┴──────────────────────┘

Container Registry Strategy

Registry Best For Key Feature
Harbor Self-hosted, air-gapped RBAC, built-in scanning, replication
Amazon ECR AWS-native teams IAM integration, lifecycle policies
Google Artifact Registry GCP-native teams Multi-format (Docker, Maven, npm)
Docker Hub Open source / small teams Largest public registry
jFrog Container Registry Enterprise, multi-cloud Universal, advanced security

Stage 5: Release Management

Semantic Versioning

Every release must have a clear, machine-readable version number:

MAJOR.MINOR.PATCH[-PRERELEASE][+BUILD]

v1.2.3
│ │ └── PATCH: backward-compatible bug fixes
│ └──── MINOR: backward-compatible new features
└────── MAJOR: breaking changes

Examples: - v1.0.0 → Initial stable release - v1.1.0 → New feature added (backward compatible) - v1.1.1 → Bug fix - v2.0.0 → Breaking API change - v1.2.0-rc.1 → Release candidate

Automated Release with Conventional Commits

Use Conventional Commits to automate version bumping and changelog generation:

feat: add one-click payment support     → bumps MINOR (1.2.0 → 1.3.0)
fix: correct tax calculation rounding   → bumps PATCH  (1.3.0 → 1.3.1)
feat!: redesign checkout API            → bumps MAJOR  (1.3.1 → 2.0.0)
docs: update API reference              → no version bump
chore: upgrade dependencies             → no version bump
# .releaserc.yml — semantic-release configuration
branches:
  - main
  - name: beta
    prerelease: true

plugins:
  - "@semantic-release/commit-analyzer"
  - "@semantic-release/release-notes-generator"
  - "@semantic-release/changelog"
  - "@semantic-release/git"
  - "@semantic-release/github"

verifyConditions:
  - "@semantic-release/github"

prepare:
  - "@semantic-release/changelog"
  - "@semantic-release/git"

publish:
  - "@semantic-release/github"

Release Tracking

Every release should be documented with:

## Release v1.3.0 — 2026-05-18

### Changes
- feat: one-click payment with saved cards (#234)
- feat: support Apple Pay and Google Pay (#241)
- fix: correct tax calculation for EU customers (#238)

### Security
- Updated cryptography from 42.0.0 to 42.0.5 (CVE-2024-YYYY)

### Breaking Changes
None

### Deployment Notes
- Run migration: `alembic upgrade head`
- Set new env var: `PAYMENT_TOKENIZATION_KEY`

### Metrics Baseline (before deployment)
- Error rate: 0.02%
- p99 latency: 180ms
- Deployment frequency: 3x/week

Stage 6: Comprehensive Testing

The Testing Spectrum

Testing isn't just about unit tests. A mature testing strategy covers the full spectrum from fast developer-local tests to slow production validation:

← Faster / Cheaper / More Isolated          Slower / More Real →

Unit → Integration → Contract → E2E → Performance → Chaos → Production

Test Types Deep Dive

Test individual functions/classes in isolation. Mock all external dependencies.

# Fast, no I/O, no network
def test_order_total_with_discount():
    order = Order(
        items=[
            OrderItem(product_id="P1", quantity=2, unit_price=Decimal("50.00")),
            OrderItem(product_id="P2", quantity=1, unit_price=Decimal("30.00")),
        ]
    )
    coupon = Coupon(code="SAVE20", discount_percent=20)

    total = order.calculate_total(coupon=coupon)

    assert total == Decimal("104.00")  # (100 + 30) * 0.8

Target: > 80% code coverage. Run time: < 60 seconds for entire suite.

Test how components work together. Use real databases (in Docker), real message queues.

# tests/integration/test_order_repository.py
import pytest
from sqlalchemy import create_engine
from repositories.order_repository import SQLOrderRepository

@pytest.fixture(scope="session")
def test_db():
    engine = create_engine("postgresql://test:test@localhost:5433/testdb")
    # Run migrations
    run_migrations(engine)
    yield engine
    engine.dispose()

def test_save_and_retrieve_order(test_db):
    repo = SQLOrderRepository(test_db)
    order = Order.create(customer_id="C001", items=[...])

    repo.save(order)
    retrieved = repo.find_by_id(order.id)

    assert retrieved.id == order.id
    assert len(retrieved.items) == len(order.items)

Verify that a consumer and provider agree on an API contract — without requiring both to be live simultaneously.

# Consumer side (using Pact)
from pact import Consumer, Provider

pact = Consumer("payment-service").has_pact_with(
    Provider("user-service"),
    host_name="localhost",
    port=1234
)

def test_get_user_payment_methods():
    (pact
     .given("User U001 has two saved payment methods")
     .upon_receiving("a request for payment methods")
     .with_request("GET", "/users/U001/payment-methods")
     .will_respond_with(200, body={
         "user_id": "U001",
         "methods": like([{
             "id": "PM001",
             "type": "credit_card",
             "last_four": "4242"
         }])
     }))

    with pact:
        result = get_user_payment_methods("U001")
        assert len(result) >= 1

Test the full user journey through real UI or API endpoints.

# Using Playwright (Python)
from playwright.sync_api import Page

def test_complete_checkout_flow(page: Page):
    # Navigate to product
    page.goto("https://staging.myapp.com/products/laptop-pro")
    page.click("[data-testid='add-to-cart']")

    # Go to checkout
    page.click("[data-testid='checkout-button']")
    page.fill("[name='email']", "test@example.com")

    # Payment
    page.fill("[name='card-number']", "4242 4242 4242 4242")
    page.fill("[name='expiry']", "12/28")
    page.fill("[name='cvv']", "123")
    page.click("[data-testid='place-order']")

    # Verify confirmation
    page.wait_for_selector("[data-testid='order-confirmation']")
    assert "Order confirmed" in page.inner_text("h1")

Verify the system behaves correctly under load.

# locustfile.py
from locust import HttpUser, task, between

class CheckoutUser(HttpUser):
    wait_time = between(1, 3)

    @task(3)
    def browse_products(self):
        self.client.get("/api/products?category=electronics")

    @task(2)
    def view_product(self):
        self.client.get("/api/products/laptop-pro")

    @task(1)
    def add_to_cart(self):
        self.client.post("/api/cart/items", json={
            "product_id": "laptop-pro",
            "quantity": 1
        })

Run: locust --headless --users 1000 --spawn-rate 100 --run-time 5m

Performance SLOs to test against: - p50 response time < 100ms - p99 response time < 500ms - Error rate < 0.1% at peak load - Throughput > 1,000 RPS

Security Testing: DAST and IAST

DAST (Dynamic Application Security Testing) attacks your running application from the outside:

# OWASP ZAP automated scan against staging
docker run -t owasp/zap2docker-stable zap-baseline.py \
    -t https://staging.myapp.com \
    -r zap-report.html \
    --exit-code 1

# Fail the pipeline if HIGH or CRITICAL issues found

IAST (Interactive Application Security Testing) instruments your application from within to detect vulnerabilities as tests run:

  • Contrast Security
  • Seeker by Synopsys
  • HCL AppScan IAST

QA Tool Landscape

Category Tools Use Case
Browser automation Selenium, Playwright, Cypress E2E web testing
Mobile testing Appium, Detox iOS/Android automation
Visual regression Applitools, Percy Catch UI regressions
API testing Postman, REST Assured, httpx API contract and functional
Performance JMeter, Locust, k6, Blazemeter Load and stress testing
Service virtualization WireMock, Mockoon Mock external dependencies
Cross-browser cloud Saucelabs, BrowserStack Multi-browser/device testing

Stage 7: Deployment Strategies

Choosing the Right Deployment Strategy

Not every change needs the same deployment strategy. Match the strategy to the risk:

Strategy Risk Speed Rollback When to Use
Recreate High Fast Slow Dev environments only
Rolling Update Medium Medium Medium Default for low-traffic services
Blue-Green Low Fast Instant Critical services, DB migrations
Canary Very Low Slow Instant New features, uncertain impact
A/B Testing Very Low Slowest Instant Feature experiments
Shadow None Slowest N/A Testing ML models in production

Blue-Green Deployment

Two identical environments. Traffic switches instantly. Zero downtime.

                   ┌──────────────────┐
                   │   Load Balancer  │
                   └────────┬─────────┘
                            │ 100% traffic
                     ┌──────▼──────┐
                     │  Blue (v1)  │  ← Currently live
                     └─────────────┘

[Deploy v2 to Green environment]

                   ┌──────────────────┐
                   │   Load Balancer  │
                   └────────┬─────────┘
                            │ 100% traffic
                     ┌──────▼──────┐
                     │  Green (v2) │  ← Switch over (instant)
                     └─────────────┘
                     ┌─────────────┐
                     │  Blue (v1)  │  ← Keep alive for rollback
                     └─────────────┘

Kubernetes Blue-Green with Argo Rollouts:

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    blueGreen:
      activeService: payment-service-active
      previewService: payment-service-preview
      autoPromotionEnabled: false       # Manual promotion required
      scaleDownDelaySeconds: 600        # Keep blue alive 10 min
      prePromotionAnalysis:
        templates:
        - templateName: error-rate-check
        args:
        - name: service-name
          value: payment-service-preview

Canary Deployment

Route a small percentage of traffic to the new version, gradually increase if metrics are healthy.

v1 (stable) ──────── 95% ──────────────→ Users
v2 (canary) ────────  5% ──────────────→ Users

[Monitor: error rate, latency, business metrics]

v1 (stable) ──────── 70% ──────────────→ Users
v2 (canary) ──────── 30% ──────────────→ Users

[All good → promote]

v2 (stable) ──────── 100% ─────────────→ Users
# Kubernetes Canary with Argo Rollouts
spec:
  strategy:
    canary:
      steps:
      - setWeight: 5          # 5% traffic to canary
      - pause: {duration: 5m} # Wait 5 minutes
      - analysis:             # Automated analysis
          templates:
          - templateName: canary-analysis
      - setWeight: 20         # 20% if analysis passes
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100        # Full promotion
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vsvc

Feature Flags: Deployment ≠ Release

Separate the act of deploying code from the act of releasing a feature to users:

# Using feature flags (LaunchDarkly / Unleash / Flagsmith)
from ldclient import LDClient

def checkout(user_id: str, cart: Cart) -> CheckoutResult:
    client = LDClient(sdk_key=settings.LAUNCH_DARKLY_KEY)

    user = {"key": user_id, "email": get_user_email(user_id)}

    if client.variation("one-click-checkout", user, False):
        # New one-click flow — only shown to flagged users
        return one_click_checkout(cart)
    else:
        # Original flow for everyone else
        return standard_checkout(cart)

This lets you: - Deploy to production without activating the feature - Enable for internal users first (dogfooding) - Gradually roll out to 1% → 10% → 100% of users - Instantly kill switch if something goes wrong


Stage 8: Post-Deployment Verification

Smoke Tests in Production

Run a minimal set of critical-path tests immediately after deployment to verify the service is up:

# tests/smoke/test_payment_smoke.py
import httpx
import pytest

BASE_URL = os.getenv("TARGET_URL", "https://api.myapp.com")

def test_health_check():
    response = httpx.get(f"{BASE_URL}/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_payment_endpoint_responds():
    response = httpx.get(f"{BASE_URL}/api/v1/payments/methods")
    assert response.status_code in [200, 401]  # OK or needs auth, not 500

def test_critical_dependency_connectivity():
    response = httpx.get(f"{BASE_URL}/health/deep")
    health = response.json()
    assert health["database"] == "connected"
    assert health["cache"] == "connected"
    # Payment gateway may be external, check degraded mode
    assert health["payment_gateway"] in ["connected", "degraded"]

Infrastructure Vulnerability Scanning

After deployment, scan the infrastructure itself:

# Tenable Nessus — scan deployed infrastructure
nessus --target payment-service.prod.internal \
       --policy "PCI DSS Compliance" \
       --output-format pdf \
       --output-file infra-scan-report.pdf

# Falco — real-time container runtime security
# Detects unusual activity in running containers
cat /etc/falco/falco_rules.yaml
# Falco rule: detect container trying to write to /etc
- rule: Write to /etc in container
  desc: Attempt to write to /etc directory in a container
  condition: >
    container and
    open_write and
    fd.name startswith /etc
  output: "File opened for writing under /etc (%user.name %proc.name %fd.name)"
  priority: ERROR

Chaos Engineering

Proactively break things in production to find weaknesses before they find you:

# chaos-monkey.py — simplified chaos test
import random
import subprocess
from kubernetes import client, config

def random_pod_kill(namespace: str, label_selector: str):
    """Kill a random pod to test resilience."""
    config.load_kube_config()
    v1 = client.CoreV1Api()

    pods = v1.list_namespaced_pod(
        namespace=namespace,
        label_selector=label_selector
    )

    if not pods.items:
        raise ValueError(f"No pods found matching {label_selector}")

    target = random.choice(pods.items)
    print(f"Killing pod: {target.metadata.name}")

    v1.delete_namespaced_pod(
        name=target.metadata.name,
        namespace=namespace
    )

    # System should recover automatically via Kubernetes

Chaos Engineering maturity levels:

  1. Level 1: Terminate a random pod — does Kubernetes restart it?
  2. Level 2: Kill an entire availability zone — does traffic failover?
  3. Level 3: Introduce latency on a critical dependency — does the circuit breaker trip?
  4. Level 4: Saturate CPU/memory on a node — does the HPA scale out?
  5. Level 5: Simulate a database failover — does the application recover?

Stage 9: Monitoring and Observability

The Three Pillars of Observability

Logging, metrics, and traces are distinct but complementary:

LOGS     → What happened? (events, errors, audit trail)
METRICS  → How is the system performing? (numbers over time)
TRACES   → Why is this request slow? (distributed request path)

Metrics: The RED Method

For every service, instrument these three metric types:

Metric Prometheus Query Alert Condition
Rate (RPS) rate(http_requests_total[5m]) < 50% of baseline
Errors rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 1%
Duration histogram_quantile(0.99, http_request_duration_seconds_bucket) > 500ms
# Instrumenting a FastAPI service
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Request
import time

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_code"]
)
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status_code=response.status_code
    ).inc()

    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
def metrics():
    return generate_latest()

Structured Logging

Unstructured logs are noise. Structured logs are searchable, filterable, and queryable:

# Hard to parse, hard to search
print(f"Processing payment for user {user_id}, amount {amount}")
print(f"Error: payment failed - {error}")
import structlog

log = structlog.get_logger()

log.info(
    "payment.processing",
    user_id=user_id,
    amount=str(amount),
    currency=currency,
    payment_method_id=payment_method_id,
    trace_id=get_trace_id()
)

log.error(
    "payment.failed",
    user_id=user_id,
    error_code=error.code,
    error_message=str(error),
    payment_method_id=payment_method_id,
    trace_id=get_trace_id()
)

Output (JSON):

{
  "timestamp": "2026-05-18T10:23:41.123Z",
  "level": "error",
  "event": "payment.failed",
  "user_id": "U001",
  "error_code": "card_declined",
  "error_message": "Your card has insufficient funds",
  "payment_method_id": "PM005",
  "trace_id": "abc123def456"
}

Distributed Tracing with OpenTelemetry

Traces let you follow a single request across multiple services:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("payment-service")

# Instrument code
async def process_payment(payment_request: PaymentRequest):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("user.id", payment_request.user_id)
        span.set_attribute("payment.amount", float(payment_request.amount))
        span.set_attribute("payment.currency", payment_request.currency)

        try:
            # Each downstream call creates a child span automatically
            user = await user_service.get_user(payment_request.user_id)
            result = await payment_gateway.charge(payment_request)

            span.set_attribute("payment.transaction_id", result.transaction_id)
            span.set_status(trace.StatusCode.OK)
            return result

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

SLO-Based Alerting

Don't alert on symptoms — alert on user impact:

# SLO definition (YAML for Pyrra / Sloth)
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: payment-service-availability
spec:
  target: "99.9"          # 99.9% success rate
  window: 30d             # Measured over 30 days
  indicator:
    ratio:
      errors:
        metric: http_requests_total{job="payment-service",status=~"5.."}
      total:
        metric: http_requests_total{job="payment-service"}

  # Alert when we're burning error budget too fast
  alerting:
    burnRateAlerts:
    - short: 5m
      long: 1h
      burnRate: 14.4      # 1h burn: page immediately
      severity: critical
    - short: 30m
      long: 6h
      burnRate: 6         # 6h burn: warn the team
      severity: warning

Observability Platform Selection

Platform Best For Key Strength
Grafana + Prometheus + Loki + Tempo Self-hosted, cost-conscious Full LGTM stack, open source
Datadog Enterprise, multi-cloud Best all-in-one experience
Dynatrace Large enterprises, AI-ops Auto-instrumentation, Davis AI
New Relic Full-stack observability Generous free tier
Elastic Stack (ELK/EFK) Log-heavy workloads Powerful search and analytics
Splunk Security + ops combined SIEM capabilities built in
AppDynamics (Appdynamic) Java/.NET enterprise Deep APM, business metrics

Stage 10: Continuous Improvement and Feedback Loops

DORA Metrics: The North Star

Google's DORA research identified four metrics that predict high-performing engineering organizations:

Metric Elite High Medium Low
Deployment Frequency Multiple/day Daily–weekly Weekly–monthly < Monthly
Lead Time for Changes < 1 hour 1 day – 1 week 1 week – 1 month > 1 month
Change Failure Rate < 5% 5–10% 10–15% > 15%
MTTR < 1 hour < 1 day < 1 week > 1 month

Measuring DORA in practice:

# Simple DORA metric calculator
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List

@dataclass
class Deployment:
    deployed_at: datetime
    lead_time_hours: float  # commit to production
    failed: bool

@dataclass
class Incident:
    started_at: datetime
    resolved_at: datetime
    caused_by_deployment: bool

def calculate_dora_metrics(
    deployments: List[Deployment],
    incidents: List[Incident],
    window_days: int = 30
) -> dict:
    cutoff = datetime.now() - timedelta(days=window_days)
    recent_deploys = [d for d in deployments if d.deployed_at > cutoff]
    recent_incidents = [i for i in incidents if i.started_at > cutoff]

    # Deployment Frequency
    deploy_freq = len(recent_deploys) / window_days  # per day

    # Lead Time (average)
    avg_lead_time = sum(d.lead_time_hours for d in recent_deploys) / len(recent_deploys)

    # Change Failure Rate
    failed = [d for d in recent_deploys if d.failed]
    cfr = len(failed) / len(recent_deploys) * 100

    # MTTR
    deployment_incidents = [i for i in recent_incidents if i.caused_by_deployment]
    if deployment_incidents:
        avg_mttr = sum(
            (i.resolved_at - i.started_at).total_seconds() / 3600
            for i in deployment_incidents
        ) / len(deployment_incidents)
    else:
        avg_mttr = 0

    return {
        "deployment_frequency_per_day": round(deploy_freq, 2),
        "avg_lead_time_hours": round(avg_lead_time, 1),
        "change_failure_rate_pct": round(cfr, 1),
        "mean_time_to_recover_hours": round(avg_mttr, 1),
    }

Value Stream Mapping

Identify where time is wasted in your delivery process by mapping the full value stream:

[Idea Created]──3 days──[Dev Starts]──5 days──[Code Review]──1 day──[CI/CD]──2h──[Staging]──3 days──[Production]

Total Lead Time: 12+ days
Value-Added Time: ~6 hours (actual coding + pipeline)
Waste: ~11.5 days (waiting, handoffs, approvals)

Common waste categories to eliminate:

Waste Example Fix
Overproduction Building features nobody uses OKR alignment, user research
Waiting PR sits for 3 days without review PR review SLA, async culture
Over-processing 15 manual approval steps Automate, trust tests
Defects Bug found in production Shift testing left
Transportation Email → Jira → Slack → meeting Single source of truth
Partially done Feature branches open for weeks Trunk-based dev + feature flags
Motion Context switching between 6 projects WIP limits, team focus

Blameless Post-Mortem

When something breaks, learn from it without blame:

## Incident Report: Payment Service Outage — 2026-05-10

**Summary**: Payment service unavailable for 23 minutes, affecting ~4,200 users.

### Timeline
- 14:32 — Deployment of v2.1.0 began
- 14:38 — Deployment complete, health checks passing
- 14:41 — First alert: error rate >5%
- 14:43 — On-call engineer paged
- 14:47 — Root cause identified: new Redis connection pool exhausted
- 14:55 — Rollback initiated
- 15:01 — Service restored, error rate < 0.1%

### Root Cause
New feature added Redis caching but used default pool size (10 connections).
Under production load, pool exhausted causing connection timeouts.

### Why Didn't We Catch This?
- Load test used only 100 concurrent users (production peaks at 2,000)
- Redis connection pool size not monitored as a metric
- No staging environment that matches production scale

### Action Items
| Action | Owner | Due Date |
|--------|-------|---------|
| Add Redis connection pool utilization to dashboards | SRE Team | 2026-05-12 |
| Update load tests to use production-scale traffic | QA Team | 2026-05-18 |
| Add Redis pool exhaustion alert | SRE Team | 2026-05-12 |
| Create staging environment at 25% production scale | Platform Team | 2026-06-01 |

### What Went Well
- Alert fired within 3 minutes of degradation
- Rollback procedure executed without confusion
- On-call runbook was accurate and helpful

Putting It All Together: The Complete CI/CD Pipeline

Here is a complete GitHub Actions pipeline implementing all stages:

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: harbor.mycompany.com
  IMAGE_NAME: payment-service
  PYTHON_VERSION: "3.12"

jobs:
  # ─── Stage 1: Code Quality ──────────────────────────────────────
  code-quality:
    name: Code Quality
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install poetry && poetry install

      - name: Lint and format check
        run: |
          poetry run black --check .
          poetry run ruff check .

      - name: SAST — Semgrep
        uses: semgrep/semgrep-action@v1
        with:
          config: p/python p/owasp-top-ten

      - name: SonarQube scan
        uses: SonarSource/sonarqube-scan-action@v2
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}

  # ─── Stage 2: Unit Tests ────────────────────────────────────────
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: code-quality
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install and test
        run: |
          pip install poetry && poetry install
          poetry run pytest tests/unit \
            --cov=src \
            --cov-report=xml \
            --cov-fail-under=80 \
            -v

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          token: ${{ secrets.CODECOV_TOKEN }}

  # ─── Stage 3: Integration Tests ─────────────────────────────────
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: unit-tests
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_PASSWORD: testpass
          POSTGRES_DB: testdb
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Run integration tests
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost/testdb
          REDIS_URL: redis://localhost:6379
        run: |
          pip install poetry && poetry install
          poetry run pytest tests/integration -v

  # ─── Stage 4: Build and Scan Image ──────────────────────────────
  build-image:
    name: Build & Scan Image
    runs-on: ubuntu-latest
    needs: integration-tests
    if: github.event_name == 'push'
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ secrets.REGISTRY_USER }}
          password: ${{ secrets.REGISTRY_PASSWORD }}

      - name: Build image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: false
          load: true
          tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: cyclonedx-json
          output-file: sbom.json

      - name: Scan image — Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: table
          exit-code: 1
          severity: HIGH,CRITICAL

      - name: Push image
        if: success()
        run: |
          docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}

      - name: Sign image
        uses: sigstore/cosign-installer@v3
        run: |
          cosign sign --key env://COSIGN_PRIVATE_KEY \
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
        env:
          COSIGN_PRIVATE_KEY: ${{ secrets.COSIGN_PRIVATE_KEY }}

  # ─── Stage 5: Deploy to Staging ─────────────────────────────────
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: build-image
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Update image tag in GitOps repo
        run: |
          git clone https://x-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/gitops-repo.git
          cd gitops-repo
          yq e '.spec.template.spec.containers[0].image = "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"' \
            -i apps/payment-service/staging/deployment.yaml
          git commit -am "chore: deploy payment-service ${{ github.sha }} to staging"
          git push

      - name: Wait for ArgoCD sync
        run: |
          argocd app wait payment-service-staging \
            --timeout 300 \
            --health

      - name: Run smoke tests
        run: |
          poetry run pytest tests/smoke \
            --base-url=https://payment-staging.mycompany.com

  # ─── Stage 6: Deploy to Production ──────────────────────────────
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Update image tag in GitOps repo (production)
        run: |
          git clone https://x-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/gitops-repo.git
          cd gitops-repo
          yq e '.spec.template.spec.containers[0].image = "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"' \
            -i apps/payment-service/production/deployment.yaml
          git commit -am "chore: deploy payment-service ${{ github.sha }} to production"
          git push

      - name: Create GitHub release
        uses: semantic-release/semantic-release@v23

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "✅ payment-service deployed to production",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*payment-service* deployed to production\nCommit: ${{ github.sha }}\nTriggered by: ${{ github.actor }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

DevOps Framework at a Glance

┌─────────────────────────────────────────────────────────────────────┐
│                    DevOps Delivery Framework                         │
├─────────┬────────────────────┬──────────────────────────────────────┤
│ Stage   │ Gate (Must Pass)   │ Key Tools                            │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Plan    │ Requirements clear │ Jira, Confluence, Miro               │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Code    │ PR approved,       │ GitHub/GitLab, SonarQube, Semgrep,   │
│         │ SAST clean         │ Snyk, pre-commit                     │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Build   │ Unit tests pass,   │ Maven/Poetry/npm, JFrog, Nexus,      │
│         │ coverage > 80%     │ JUnit, pytest                        │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Package │ Image scan clean,  │ Docker, Trivy, Snyk, Harbor, Cosign  │
│         │ image signed       │ Syft (SBOM), ECR                     │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Test    │ Integration pass,  │ Pytest, Selenium, Playwright, k6,    │
│         │ no HIGH CVEs       │ OWASP ZAP, Pact, WireMock            │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Release │ Version bumped,    │ semantic-release, GitHub Releases,   │
│         │ changelog updated  │ Conventional Commits                 │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Deploy  │ Smoke tests pass,  │ Argo CD, Argo Rollouts, Helm,        │
│         │ health checks OK   │ Kubernetes, Istio                    │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Operate │ SLOs met           │ Prometheus, Grafana, AlertManager,   │
│         │                    │ PagerDuty, Runbooks                  │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Monitor │ DORA targets met   │ Datadog, Grafana, Loki, Tempo,       │
│         │                    │ Dynatrace, Elastic Stack             │
└─────────┴────────────────────┴──────────────────────────────────────┘
              ↑_______________ Feedback Loop _______________↑

Summary

If you are starting from scratch, don't try to implement everything at once. Use this prioritized roadmap:

Month 1 — Foundation

  • Set up version control with PR reviews required
  • Add pre-commit hooks (linting, secret detection)
  • Write basic unit tests (target 50% coverage)
  • Containerize your application
  • Set up a basic CI pipeline (lint → test → build)

Month 2 — Quality and Security

  • Add integration tests
  • Add SAST scanning to CI
  • Add container image scanning
  • Set up artifact registry
  • Implement Conventional Commits + semantic versioning

Month 3 — Automation and Deployment

  • Automate deployment to staging
  • Add smoke tests post-deployment
  • Implement blue-green or canary deployment
  • Set up GitOps with Argo CD

Month 4+ — Observability and Improvement

  • Implement structured logging
  • Add Prometheus metrics (RED method)
  • Set up distributed tracing (OpenTelemetry)
  • Define and monitor SLOs
  • Track and review DORA metrics monthly
  • Run your first chaos engineering experiment

A DevOps framework is never finished — it evolves as your team grows, your product matures, and the technology landscape changes. The goal is not to have a perfect pipeline on day one, but to continuously close the feedback loop between production reality and development decisions. Start small, measure everything, and iterate.

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.