The DevOps Delivery Pipeline: End-to-End Framework¶
A DevOps framework is not a single tool or process — it is an end-to-end system that connects your product idea to production software, with quality and security built in at every step.
Why You Need a Framework, Not Just Tools¶
Most teams adopt DevOps by collecting tools: "We use Jenkins for CI, Docker for containers, and Kubernetes for deployment." This is tool-first thinking, and it leads to fragmented pipelines where nobody really knows how a commit becomes a production release.
A framework is different. It defines:
- Stages — what happens at each phase of the journey
- Gates — what must be true before moving to the next stage
- Feedback loops — how information flows back from later stages to earlier ones
- Ownership — who is responsible for each stage
Idea → Plan → Code → Build → Test → Release → Deploy → Operate → Monitor
↑__________________feedback____________________________________|
The feedback loop is the most important part. A DevOps framework without feedback is just a waterfall with Docker.
Stage 1: Team Structure and Product Discovery¶
The Myth of the "DevOps Team"¶
One of the most common mistakes is creating a dedicated DevOps team that acts as a gatekeeper. This recreates the wall of confusion with a new label. Instead, DevOps principles work best with cross-functional product teams that own the full lifecycle of their services.
A healthy product team looks like this:
| Role | Responsibility | Anti-Pattern |
|---|---|---|
| Product Owner | Define what to build and why | Writing tickets without talking to users |
| Designer (UX/UI) | Understand users, design solutions | Designing in a silo, handing off to devs |
| Backend Engineer | Build services, APIs, data layers | "It works on my machine" |
| Frontend Engineer | Build user interfaces | Not caring about API contracts |
| Security Engineer | Embed security from day one | Reviewing only at the end |
| Platform/SRE | Build developer tooling, maintain reliability | Owning Kubernetes, not enabling teams |
| QA Engineer | Verify quality at every stage | Writing tests only after features ship |
The Why–What–How Model¶
Before writing a single line of code, teams should align on three levels:
WHY → Business problem / user need / strategic goal
"We lose 20% of users at checkout because the payment form is too complex"
WHAT → The solution space / product requirement
"A one-click payment flow using saved payment methods"
HOW → Technical implementation
"Add a /payments/quick-checkout endpoint that uses tokenized cards from Stripe"
Skipping the WHY leads to teams that build the wrong thing extremely efficiently.
Discovery Tooling¶
| Purpose | Tools |
|---|---|
| Portfolio / Epic tracking | Jira, Linear, Shortcut |
| Documentation & wikis | Confluence, Notion, Backstage |
| Diagrams & architecture | Miro, draw.io, Excalidraw |
| Security architecture | Threat Dragon, OWASP Threat Modeling |
| API design | Stoplight, Swagger Editor, Redocly |
Stage 2: Source Code Management¶
Everything as Code¶
Modern DevOps treats everything as code — not just application logic, but also infrastructure, tests, security policies, and documentation. This is the foundation of reproducibility.
repository/
├── src/ # Application code
│ ├── api/
│ ├── services/
│ └── models/
├── tests/ # All test types
│ ├── unit/
│ ├── integration/
│ ├── e2e/
│ └── performance/
├── infra/ # Infrastructure as Code
│ ├── terraform/
│ └── helm/
├── .github/workflows/ # CI/CD pipelines
├── .pre-commit-config.yaml
├── Dockerfile
└── docker-compose.yml
Branching Strategies¶
Choose a branching strategy based on your team's release cadence:
main (trunk)
│
├── feature/add-payment (short-lived, < 2 days)
├── feature/user-profile (short-lived, < 2 days)
│
└── tags: v1.2.0, v1.2.1, v1.3.0
- All developers integrate to
maindaily - Feature flags hide incomplete work
- Best for teams with strong CI and high test coverage
- Reduces merge conflicts dramatically
main │ ├── feature/checkout-redesign │ └── PR → code review → merge ├── fix/payment-timeout │ └── PR → code review → merge
- Simple: main is always deployable
- Every change goes through a PR
- Good for web teams with continuous deployment
main ─────────────────────────────────────────
↑ merge at release
release/1.2 ───────────────
↑ cut
develop ─────────────────────────────────────
↑ merge feature branches
feature/* ────────
hotfix/* ──── (cherry-pick to main and develop)
- Useful for products with scheduled releases
- Higher ceremony, more merge conflicts
- Avoid if you can deploy frequently
Code Quality Gates (Pre-Commit)¶
Stop bad code before it hits the repository:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: detect-private-key # Catch secrets early
- id: check-added-large-files
args: ['--maxkb=500']
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.0
hooks:
- id: ruff
args: [--fix]
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.28.0
hooks:
- id: check-github-workflows
Static Analysis Tools¶
| Tool | What It Finds | When to Use |
|---|---|---|
| SonarQube | Code smells, bugs, coverage, duplication | CI pipeline, PR gate |
| Snyk | Open-source vulnerabilities (SCA) | CI pipeline, nightly |
| Checkmarx | SAST — security vulnerabilities in code | CI pipeline |
| Semgrep | Custom security rules, fast SAST | Pre-commit, CI |
| Veracode | Enterprise SAST/DAST with compliance reports | Regulated industries |
| Black Duck | License compliance + OSS vulnerabilities | Legal/compliance gate |
Stage 3: The Build Phase¶
What Happens in a Build¶
A build transforms source code into deployable artifacts. It must be:
- Reproducible — same inputs always produce same outputs
- Fast — slow builds break flow
- Isolated — no dependency on developer's local machine
# Example: pyproject.toml with pinned dependencies (reproducible builds)
[tool.poetry]
name = "payment-service"
version = "0.1.0"
[tool.poetry.dependencies]
python = "^3.12"
fastapi = "0.110.0"
sqlalchemy = "2.0.28"
pydantic = "2.6.3"
[tool.poetry.group.dev.dependencies]
pytest = "7.4.4"
pytest-cov = "4.1.0"
black = "24.3.0"
Build Tool Selection¶
| Ecosystem | Build Tool | Package Registry |
|---|---|---|
| Python | pip, Poetry, uv | PyPI, private Nexus/JFrog |
| Java | Maven, Gradle | Maven Central, Artifactory |
| JavaScript/TS | npm, pnpm, yarn | npmjs, Verdaccio |
| Go | go build | pkg.go.dev, GOPROXY |
| .NET | MSBuild, dotnet | NuGet, Artifactory |
Testing During Build: The Testing Pyramid¶
/\
/ \
/ E2E \ ← Few, slow, expensive (Selenium, Playwright)
/──────\
/ Integ \ ← Some (API tests, DB tests, contract tests)
/────────────\
/ Unit Tests \ ← Many, fast, cheap (pytest, JUnit)
/──────────────────\
/ Static Analysis \ ← All code, always (SonarQube, Semgrep)
Unit test example (Python/pytest):
# tests/unit/test_payment_service.py
import pytest
from decimal import Decimal
from services.payment import PaymentService, InsufficientFundsError
class TestPaymentService:
def test_process_payment_success(self):
service = PaymentService(gateway=MockGateway())
result = service.process(amount=Decimal("99.99"), currency="USD")
assert result.status == "approved"
assert result.transaction_id is not None
def test_process_payment_insufficient_funds(self):
gateway = MockGateway(decline_reason="insufficient_funds")
service = PaymentService(gateway=gateway)
with pytest.raises(InsufficientFundsError):
service.process(amount=Decimal("999.99"), currency="USD")
def test_process_payment_invalid_amount(self):
service = PaymentService(gateway=MockGateway())
with pytest.raises(ValueError, match="Amount must be positive"):
service.process(amount=Decimal("-10.00"), currency="USD")
Artifact Storage Strategy¶
Never rebuild the same code twice. Store built artifacts in a registry:
Build → Artifact Registry (Nexus / JFrog Artifactory / GitHub Packages)
│
├── Python wheels (.whl)
├── Java JARs
├── npm packages
└── Container images (see Stage 4)
Why artifact registries matter:
- Build once, deploy many times — same binary to dev/staging/prod
- Rollbacks are instant — the old artifact is still there
- Security scanning happens once, not per environment
- License compliance is tracked centrally
Stage 4: Containerization and Image Security¶
The Container Build Pipeline¶
Building a container image is more than running docker build. A production-grade image pipeline looks like this:
Source Code
│
▼
Dockerfile / Buildpacks
│
▼
Build → OCI Image
│
▼
SBOM Generation (CycloneDX / SPDX)
│
▼
SCA Scan (Trivy / Snyk / Black Duck)
│
├── PASS → Push to Registry
└── FAIL → Block pipeline, notify team
│
▼
Container Registry (Harbor / ECR / Docker Hub / jFrog)
│
▼
Image Signing (Cosign / Notary)
Writing Secure Dockerfiles¶
# Stage 1: Build dependencies
FROM python:3.12.3-slim AS builder
WORKDIR /build
COPY pyproject.toml poetry.lock ./
RUN pip install --no-cache-dir poetry==1.8.2 && \
poetry export -f requirements.txt --output requirements.txt --without-hashes
# Stage 2: Runtime image
FROM python:3.12.3-slim AS runtime
# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
# Install only runtime dependencies
COPY --from=builder /build/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
rm -rf /root/.cache
COPY --chown=appuser:appuser src/ ./src/
USER appuser
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
CMD ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]
SBOM: Software Bill of Materials¶
An SBOM is an ingredient list for your software. It tells you exactly what open-source components are inside your container image — critical for vulnerability response and license compliance.
# Generate SBOM with Syft
syft payment-service:v1.2.0 -o cyclonedx-json > sbom.json
# Scan SBOM for vulnerabilities with Grype
grype sbom:./sbom.json --fail-on high
# Scan with Trivy
trivy image --severity HIGH,CRITICAL payment-service:v1.2.0
# Sign the image after scanning passes
cosign sign --key cosign.key payment-service:v1.2.0
Sample Trivy output interpretation:
payment-service:v1.2.0 (debian 12.5)
Total: 3 (HIGH: 2, CRITICAL: 1)
┌──────────────┬────────────────┬──────────┬───────────────────┬──────────────────────┐
│ Library │ Vulnerability │ Severity │ Installed Version │ Fixed Version │
├──────────────┼────────────────┼──────────┼───────────────────┼──────────────────────┤
│ libssl3 │ CVE-2024-XXXX │ CRITICAL │ 3.0.11-1 │ 3.0.13-1 │
│ cryptography │ CVE-2024-YYYY │ HIGH │ 42.0.0 │ 42.0.5 │
│ requests │ CVE-2024-ZZZZ │ HIGH │ 2.31.0 │ 2.32.0 │
└──────────────┴────────────────┴──────────┴───────────────────┴──────────────────────┘
Container Registry Strategy¶
| Registry | Best For | Key Feature |
|---|---|---|
| Harbor | Self-hosted, air-gapped | RBAC, built-in scanning, replication |
| Amazon ECR | AWS-native teams | IAM integration, lifecycle policies |
| Google Artifact Registry | GCP-native teams | Multi-format (Docker, Maven, npm) |
| Docker Hub | Open source / small teams | Largest public registry |
| jFrog Container Registry | Enterprise, multi-cloud | Universal, advanced security |
Stage 5: Release Management¶
Semantic Versioning¶
Every release must have a clear, machine-readable version number:
MAJOR.MINOR.PATCH[-PRERELEASE][+BUILD]
v1.2.3
│ │ └── PATCH: backward-compatible bug fixes
│ └──── MINOR: backward-compatible new features
└────── MAJOR: breaking changes
Examples: - v1.0.0 → Initial stable release - v1.1.0 → New feature added (backward compatible) - v1.1.1 → Bug fix - v2.0.0 → Breaking API change - v1.2.0-rc.1 → Release candidate
Automated Release with Conventional Commits¶
Use Conventional Commits to automate version bumping and changelog generation:
feat: add one-click payment support → bumps MINOR (1.2.0 → 1.3.0)
fix: correct tax calculation rounding → bumps PATCH (1.3.0 → 1.3.1)
feat!: redesign checkout API → bumps MAJOR (1.3.1 → 2.0.0)
docs: update API reference → no version bump
chore: upgrade dependencies → no version bump
# .releaserc.yml — semantic-release configuration
branches:
- main
- name: beta
prerelease: true
plugins:
- "@semantic-release/commit-analyzer"
- "@semantic-release/release-notes-generator"
- "@semantic-release/changelog"
- "@semantic-release/git"
- "@semantic-release/github"
verifyConditions:
- "@semantic-release/github"
prepare:
- "@semantic-release/changelog"
- "@semantic-release/git"
publish:
- "@semantic-release/github"
Release Tracking¶
Every release should be documented with:
## Release v1.3.0 — 2026-05-18
### Changes
- feat: one-click payment with saved cards (#234)
- feat: support Apple Pay and Google Pay (#241)
- fix: correct tax calculation for EU customers (#238)
### Security
- Updated cryptography from 42.0.0 to 42.0.5 (CVE-2024-YYYY)
### Breaking Changes
None
### Deployment Notes
- Run migration: `alembic upgrade head`
- Set new env var: `PAYMENT_TOKENIZATION_KEY`
### Metrics Baseline (before deployment)
- Error rate: 0.02%
- p99 latency: 180ms
- Deployment frequency: 3x/week
Stage 6: Comprehensive Testing¶
The Testing Spectrum¶
Testing isn't just about unit tests. A mature testing strategy covers the full spectrum from fast developer-local tests to slow production validation:
← Faster / Cheaper / More Isolated Slower / More Real →
Unit → Integration → Contract → E2E → Performance → Chaos → Production
Test Types Deep Dive¶
Test individual functions/classes in isolation. Mock all external dependencies.
# Fast, no I/O, no network
def test_order_total_with_discount():
order = Order(
items=[
OrderItem(product_id="P1", quantity=2, unit_price=Decimal("50.00")),
OrderItem(product_id="P2", quantity=1, unit_price=Decimal("30.00")),
]
)
coupon = Coupon(code="SAVE20", discount_percent=20)
total = order.calculate_total(coupon=coupon)
assert total == Decimal("104.00") # (100 + 30) * 0.8
Target: > 80% code coverage. Run time: < 60 seconds for entire suite.
Test how components work together. Use real databases (in Docker), real message queues.
# tests/integration/test_order_repository.py
import pytest
from sqlalchemy import create_engine
from repositories.order_repository import SQLOrderRepository
@pytest.fixture(scope="session")
def test_db():
engine = create_engine("postgresql://test:test@localhost:5433/testdb")
# Run migrations
run_migrations(engine)
yield engine
engine.dispose()
def test_save_and_retrieve_order(test_db):
repo = SQLOrderRepository(test_db)
order = Order.create(customer_id="C001", items=[...])
repo.save(order)
retrieved = repo.find_by_id(order.id)
assert retrieved.id == order.id
assert len(retrieved.items) == len(order.items)
Verify that a consumer and provider agree on an API contract — without requiring both to be live simultaneously.
# Consumer side (using Pact)
from pact import Consumer, Provider
pact = Consumer("payment-service").has_pact_with(
Provider("user-service"),
host_name="localhost",
port=1234
)
def test_get_user_payment_methods():
(pact
.given("User U001 has two saved payment methods")
.upon_receiving("a request for payment methods")
.with_request("GET", "/users/U001/payment-methods")
.will_respond_with(200, body={
"user_id": "U001",
"methods": like([{
"id": "PM001",
"type": "credit_card",
"last_four": "4242"
}])
}))
with pact:
result = get_user_payment_methods("U001")
assert len(result) >= 1
Test the full user journey through real UI or API endpoints.
# Using Playwright (Python)
from playwright.sync_api import Page
def test_complete_checkout_flow(page: Page):
# Navigate to product
page.goto("https://staging.myapp.com/products/laptop-pro")
page.click("[data-testid='add-to-cart']")
# Go to checkout
page.click("[data-testid='checkout-button']")
page.fill("[name='email']", "test@example.com")
# Payment
page.fill("[name='card-number']", "4242 4242 4242 4242")
page.fill("[name='expiry']", "12/28")
page.fill("[name='cvv']", "123")
page.click("[data-testid='place-order']")
# Verify confirmation
page.wait_for_selector("[data-testid='order-confirmation']")
assert "Order confirmed" in page.inner_text("h1")
Verify the system behaves correctly under load.
# locustfile.py
from locust import HttpUser, task, between
class CheckoutUser(HttpUser):
wait_time = between(1, 3)
@task(3)
def browse_products(self):
self.client.get("/api/products?category=electronics")
@task(2)
def view_product(self):
self.client.get("/api/products/laptop-pro")
@task(1)
def add_to_cart(self):
self.client.post("/api/cart/items", json={
"product_id": "laptop-pro",
"quantity": 1
})
Run: locust --headless --users 1000 --spawn-rate 100 --run-time 5m
Performance SLOs to test against: - p50 response time < 100ms - p99 response time < 500ms - Error rate < 0.1% at peak load - Throughput > 1,000 RPS
Security Testing: DAST and IAST¶
DAST (Dynamic Application Security Testing) attacks your running application from the outside:
# OWASP ZAP automated scan against staging
docker run -t owasp/zap2docker-stable zap-baseline.py \
-t https://staging.myapp.com \
-r zap-report.html \
--exit-code 1
# Fail the pipeline if HIGH or CRITICAL issues found
IAST (Interactive Application Security Testing) instruments your application from within to detect vulnerabilities as tests run:
- Contrast Security
- Seeker by Synopsys
- HCL AppScan IAST
QA Tool Landscape¶
| Category | Tools | Use Case |
|---|---|---|
| Browser automation | Selenium, Playwright, Cypress | E2E web testing |
| Mobile testing | Appium, Detox | iOS/Android automation |
| Visual regression | Applitools, Percy | Catch UI regressions |
| API testing | Postman, REST Assured, httpx | API contract and functional |
| Performance | JMeter, Locust, k6, Blazemeter | Load and stress testing |
| Service virtualization | WireMock, Mockoon | Mock external dependencies |
| Cross-browser cloud | Saucelabs, BrowserStack | Multi-browser/device testing |
Stage 7: Deployment Strategies¶
Choosing the Right Deployment Strategy¶
Not every change needs the same deployment strategy. Match the strategy to the risk:
| Strategy | Risk | Speed | Rollback | When to Use |
|---|---|---|---|---|
| Recreate | High | Fast | Slow | Dev environments only |
| Rolling Update | Medium | Medium | Medium | Default for low-traffic services |
| Blue-Green | Low | Fast | Instant | Critical services, DB migrations |
| Canary | Very Low | Slow | Instant | New features, uncertain impact |
| A/B Testing | Very Low | Slowest | Instant | Feature experiments |
| Shadow | None | Slowest | N/A | Testing ML models in production |
Blue-Green Deployment¶
Two identical environments. Traffic switches instantly. Zero downtime.
┌──────────────────┐
│ Load Balancer │
└────────┬─────────┘
│ 100% traffic
┌──────▼──────┐
│ Blue (v1) │ ← Currently live
└─────────────┘
[Deploy v2 to Green environment]
┌──────────────────┐
│ Load Balancer │
└────────┬─────────┘
│ 100% traffic
┌──────▼──────┐
│ Green (v2) │ ← Switch over (instant)
└─────────────┘
┌─────────────┐
│ Blue (v1) │ ← Keep alive for rollback
└─────────────┘
Kubernetes Blue-Green with Argo Rollouts:
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 10
strategy:
blueGreen:
activeService: payment-service-active
previewService: payment-service-preview
autoPromotionEnabled: false # Manual promotion required
scaleDownDelaySeconds: 600 # Keep blue alive 10 min
prePromotionAnalysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: payment-service-preview
Canary Deployment¶
Route a small percentage of traffic to the new version, gradually increase if metrics are healthy.
v1 (stable) ──────── 95% ──────────────→ Users
v2 (canary) ──────── 5% ──────────────→ Users
[Monitor: error rate, latency, business metrics]
v1 (stable) ──────── 70% ──────────────→ Users
v2 (canary) ──────── 30% ──────────────→ Users
[All good → promote]
v2 (stable) ──────── 100% ─────────────→ Users
# Kubernetes Canary with Argo Rollouts
spec:
strategy:
canary:
steps:
- setWeight: 5 # 5% traffic to canary
- pause: {duration: 5m} # Wait 5 minutes
- analysis: # Automated analysis
templates:
- templateName: canary-analysis
- setWeight: 20 # 20% if analysis passes
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # Full promotion
canaryService: payment-service-canary
stableService: payment-service-stable
trafficRouting:
istio:
virtualService:
name: payment-service-vsvc
Feature Flags: Deployment ≠ Release¶
Separate the act of deploying code from the act of releasing a feature to users:
# Using feature flags (LaunchDarkly / Unleash / Flagsmith)
from ldclient import LDClient
def checkout(user_id: str, cart: Cart) -> CheckoutResult:
client = LDClient(sdk_key=settings.LAUNCH_DARKLY_KEY)
user = {"key": user_id, "email": get_user_email(user_id)}
if client.variation("one-click-checkout", user, False):
# New one-click flow — only shown to flagged users
return one_click_checkout(cart)
else:
# Original flow for everyone else
return standard_checkout(cart)
This lets you: - Deploy to production without activating the feature - Enable for internal users first (dogfooding) - Gradually roll out to 1% → 10% → 100% of users - Instantly kill switch if something goes wrong
Stage 8: Post-Deployment Verification¶
Smoke Tests in Production¶
Run a minimal set of critical-path tests immediately after deployment to verify the service is up:
# tests/smoke/test_payment_smoke.py
import httpx
import pytest
BASE_URL = os.getenv("TARGET_URL", "https://api.myapp.com")
def test_health_check():
response = httpx.get(f"{BASE_URL}/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_payment_endpoint_responds():
response = httpx.get(f"{BASE_URL}/api/v1/payments/methods")
assert response.status_code in [200, 401] # OK or needs auth, not 500
def test_critical_dependency_connectivity():
response = httpx.get(f"{BASE_URL}/health/deep")
health = response.json()
assert health["database"] == "connected"
assert health["cache"] == "connected"
# Payment gateway may be external, check degraded mode
assert health["payment_gateway"] in ["connected", "degraded"]
Infrastructure Vulnerability Scanning¶
After deployment, scan the infrastructure itself:
# Tenable Nessus — scan deployed infrastructure
nessus --target payment-service.prod.internal \
--policy "PCI DSS Compliance" \
--output-format pdf \
--output-file infra-scan-report.pdf
# Falco — real-time container runtime security
# Detects unusual activity in running containers
cat /etc/falco/falco_rules.yaml
# Falco rule: detect container trying to write to /etc
- rule: Write to /etc in container
desc: Attempt to write to /etc directory in a container
condition: >
container and
open_write and
fd.name startswith /etc
output: "File opened for writing under /etc (%user.name %proc.name %fd.name)"
priority: ERROR
Chaos Engineering¶
Proactively break things in production to find weaknesses before they find you:
# chaos-monkey.py — simplified chaos test
import random
import subprocess
from kubernetes import client, config
def random_pod_kill(namespace: str, label_selector: str):
"""Kill a random pod to test resilience."""
config.load_kube_config()
v1 = client.CoreV1Api()
pods = v1.list_namespaced_pod(
namespace=namespace,
label_selector=label_selector
)
if not pods.items:
raise ValueError(f"No pods found matching {label_selector}")
target = random.choice(pods.items)
print(f"Killing pod: {target.metadata.name}")
v1.delete_namespaced_pod(
name=target.metadata.name,
namespace=namespace
)
# System should recover automatically via Kubernetes
Chaos Engineering maturity levels:
- Level 1: Terminate a random pod — does Kubernetes restart it?
- Level 2: Kill an entire availability zone — does traffic failover?
- Level 3: Introduce latency on a critical dependency — does the circuit breaker trip?
- Level 4: Saturate CPU/memory on a node — does the HPA scale out?
- Level 5: Simulate a database failover — does the application recover?
Stage 9: Monitoring and Observability¶
The Three Pillars of Observability¶
Logging, metrics, and traces are distinct but complementary:
LOGS → What happened? (events, errors, audit trail)
METRICS → How is the system performing? (numbers over time)
TRACES → Why is this request slow? (distributed request path)
Metrics: The RED Method¶
For every service, instrument these three metric types:
| Metric | Prometheus Query | Alert Condition |
|---|---|---|
| Rate (RPS) | rate(http_requests_total[5m]) | < 50% of baseline |
| Errors | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) | > 1% |
| Duration | histogram_quantile(0.99, http_request_duration_seconds_bucket) | > 500ms |
# Instrumenting a FastAPI service
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Request
import time
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status_code"]
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
app = FastAPI()
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status_code=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
@app.get("/metrics")
def metrics():
return generate_latest()
Structured Logging¶
Unstructured logs are noise. Structured logs are searchable, filterable, and queryable:
import structlog
log = structlog.get_logger()
log.info(
"payment.processing",
user_id=user_id,
amount=str(amount),
currency=currency,
payment_method_id=payment_method_id,
trace_id=get_trace_id()
)
log.error(
"payment.failed",
user_id=user_id,
error_code=error.code,
error_message=str(error),
payment_method_id=payment_method_id,
trace_id=get_trace_id()
)
Output (JSON):
Distributed Tracing with OpenTelemetry¶
Traces let you follow a single request across multiple services:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("payment-service")
# Instrument code
async def process_payment(payment_request: PaymentRequest):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("user.id", payment_request.user_id)
span.set_attribute("payment.amount", float(payment_request.amount))
span.set_attribute("payment.currency", payment_request.currency)
try:
# Each downstream call creates a child span automatically
user = await user_service.get_user(payment_request.user_id)
result = await payment_gateway.charge(payment_request)
span.set_attribute("payment.transaction_id", result.transaction_id)
span.set_status(trace.StatusCode.OK)
return result
except Exception as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
raise
SLO-Based Alerting¶
Don't alert on symptoms — alert on user impact:
# SLO definition (YAML for Pyrra / Sloth)
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: payment-service-availability
spec:
target: "99.9" # 99.9% success rate
window: 30d # Measured over 30 days
indicator:
ratio:
errors:
metric: http_requests_total{job="payment-service",status=~"5.."}
total:
metric: http_requests_total{job="payment-service"}
# Alert when we're burning error budget too fast
alerting:
burnRateAlerts:
- short: 5m
long: 1h
burnRate: 14.4 # 1h burn: page immediately
severity: critical
- short: 30m
long: 6h
burnRate: 6 # 6h burn: warn the team
severity: warning
Observability Platform Selection¶
| Platform | Best For | Key Strength |
|---|---|---|
| Grafana + Prometheus + Loki + Tempo | Self-hosted, cost-conscious | Full LGTM stack, open source |
| Datadog | Enterprise, multi-cloud | Best all-in-one experience |
| Dynatrace | Large enterprises, AI-ops | Auto-instrumentation, Davis AI |
| New Relic | Full-stack observability | Generous free tier |
| Elastic Stack (ELK/EFK) | Log-heavy workloads | Powerful search and analytics |
| Splunk | Security + ops combined | SIEM capabilities built in |
| AppDynamics (Appdynamic) | Java/.NET enterprise | Deep APM, business metrics |
Stage 10: Continuous Improvement and Feedback Loops¶
DORA Metrics: The North Star¶
Google's DORA research identified four metrics that predict high-performing engineering organizations:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | Multiple/day | Daily–weekly | Weekly–monthly | < Monthly |
| Lead Time for Changes | < 1 hour | 1 day – 1 week | 1 week – 1 month | > 1 month |
| Change Failure Rate | < 5% | 5–10% | 10–15% | > 15% |
| MTTR | < 1 hour | < 1 day | < 1 week | > 1 month |
Measuring DORA in practice:
# Simple DORA metric calculator
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List
@dataclass
class Deployment:
deployed_at: datetime
lead_time_hours: float # commit to production
failed: bool
@dataclass
class Incident:
started_at: datetime
resolved_at: datetime
caused_by_deployment: bool
def calculate_dora_metrics(
deployments: List[Deployment],
incidents: List[Incident],
window_days: int = 30
) -> dict:
cutoff = datetime.now() - timedelta(days=window_days)
recent_deploys = [d for d in deployments if d.deployed_at > cutoff]
recent_incidents = [i for i in incidents if i.started_at > cutoff]
# Deployment Frequency
deploy_freq = len(recent_deploys) / window_days # per day
# Lead Time (average)
avg_lead_time = sum(d.lead_time_hours for d in recent_deploys) / len(recent_deploys)
# Change Failure Rate
failed = [d for d in recent_deploys if d.failed]
cfr = len(failed) / len(recent_deploys) * 100
# MTTR
deployment_incidents = [i for i in recent_incidents if i.caused_by_deployment]
if deployment_incidents:
avg_mttr = sum(
(i.resolved_at - i.started_at).total_seconds() / 3600
for i in deployment_incidents
) / len(deployment_incidents)
else:
avg_mttr = 0
return {
"deployment_frequency_per_day": round(deploy_freq, 2),
"avg_lead_time_hours": round(avg_lead_time, 1),
"change_failure_rate_pct": round(cfr, 1),
"mean_time_to_recover_hours": round(avg_mttr, 1),
}
Value Stream Mapping¶
Identify where time is wasted in your delivery process by mapping the full value stream:
[Idea Created]──3 days──[Dev Starts]──5 days──[Code Review]──1 day──[CI/CD]──2h──[Staging]──3 days──[Production]
Total Lead Time: 12+ days
Value-Added Time: ~6 hours (actual coding + pipeline)
Waste: ~11.5 days (waiting, handoffs, approvals)
Common waste categories to eliminate:
| Waste | Example | Fix |
|---|---|---|
| Overproduction | Building features nobody uses | OKR alignment, user research |
| Waiting | PR sits for 3 days without review | PR review SLA, async culture |
| Over-processing | 15 manual approval steps | Automate, trust tests |
| Defects | Bug found in production | Shift testing left |
| Transportation | Email → Jira → Slack → meeting | Single source of truth |
| Partially done | Feature branches open for weeks | Trunk-based dev + feature flags |
| Motion | Context switching between 6 projects | WIP limits, team focus |
Blameless Post-Mortem¶
When something breaks, learn from it without blame:
## Incident Report: Payment Service Outage — 2026-05-10
**Summary**: Payment service unavailable for 23 minutes, affecting ~4,200 users.
### Timeline
- 14:32 — Deployment of v2.1.0 began
- 14:38 — Deployment complete, health checks passing
- 14:41 — First alert: error rate >5%
- 14:43 — On-call engineer paged
- 14:47 — Root cause identified: new Redis connection pool exhausted
- 14:55 — Rollback initiated
- 15:01 — Service restored, error rate < 0.1%
### Root Cause
New feature added Redis caching but used default pool size (10 connections).
Under production load, pool exhausted causing connection timeouts.
### Why Didn't We Catch This?
- Load test used only 100 concurrent users (production peaks at 2,000)
- Redis connection pool size not monitored as a metric
- No staging environment that matches production scale
### Action Items
| Action | Owner | Due Date |
|--------|-------|---------|
| Add Redis connection pool utilization to dashboards | SRE Team | 2026-05-12 |
| Update load tests to use production-scale traffic | QA Team | 2026-05-18 |
| Add Redis pool exhaustion alert | SRE Team | 2026-05-12 |
| Create staging environment at 25% production scale | Platform Team | 2026-06-01 |
### What Went Well
- Alert fired within 3 minutes of degradation
- Rollback procedure executed without confusion
- On-call runbook was accurate and helpful
Putting It All Together: The Complete CI/CD Pipeline¶
Here is a complete GitHub Actions pipeline implementing all stages:
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: harbor.mycompany.com
IMAGE_NAME: payment-service
PYTHON_VERSION: "3.12"
jobs:
# ─── Stage 1: Code Quality ──────────────────────────────────────
code-quality:
name: Code Quality
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install dependencies
run: pip install poetry && poetry install
- name: Lint and format check
run: |
poetry run black --check .
poetry run ruff check .
- name: SAST — Semgrep
uses: semgrep/semgrep-action@v1
with:
config: p/python p/owasp-top-ten
- name: SonarQube scan
uses: SonarSource/sonarqube-scan-action@v2
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
# ─── Stage 2: Unit Tests ────────────────────────────────────────
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
needs: code-quality
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install and test
run: |
pip install poetry && poetry install
poetry run pytest tests/unit \
--cov=src \
--cov-report=xml \
--cov-fail-under=80 \
-v
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
# ─── Stage 3: Integration Tests ─────────────────────────────────
integration-tests:
name: Integration Tests
runs-on: ubuntu-latest
needs: unit-tests
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: testpass
POSTGRES_DB: testdb
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Run integration tests
env:
DATABASE_URL: postgresql://postgres:testpass@localhost/testdb
REDIS_URL: redis://localhost:6379
run: |
pip install poetry && poetry install
poetry run pytest tests/integration -v
# ─── Stage 4: Build and Scan Image ──────────────────────────────
build-image:
name: Build & Scan Image
runs-on: ubuntu-latest
needs: integration-tests
if: github.event_name == 'push'
outputs:
image-digest: ${{ steps.build.outputs.digest }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.REGISTRY_USER }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Build image
id: build
uses: docker/build-push-action@v5
with:
context: .
push: false
load: true
tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: cyclonedx-json
output-file: sbom.json
- name: Scan image — Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: table
exit-code: 1
severity: HIGH,CRITICAL
- name: Push image
if: success()
run: |
docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
- name: Sign image
uses: sigstore/cosign-installer@v3
run: |
cosign sign --key env://COSIGN_PRIVATE_KEY \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
env:
COSIGN_PRIVATE_KEY: ${{ secrets.COSIGN_PRIVATE_KEY }}
# ─── Stage 5: Deploy to Staging ─────────────────────────────────
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: build-image
environment: staging
steps:
- uses: actions/checkout@v4
- name: Update image tag in GitOps repo
run: |
git clone https://x-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/gitops-repo.git
cd gitops-repo
yq e '.spec.template.spec.containers[0].image = "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"' \
-i apps/payment-service/staging/deployment.yaml
git commit -am "chore: deploy payment-service ${{ github.sha }} to staging"
git push
- name: Wait for ArgoCD sync
run: |
argocd app wait payment-service-staging \
--timeout 300 \
--health
- name: Run smoke tests
run: |
poetry run pytest tests/smoke \
--base-url=https://payment-staging.mycompany.com
# ─── Stage 6: Deploy to Production ──────────────────────────────
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: deploy-staging
environment: production
steps:
- uses: actions/checkout@v4
- name: Update image tag in GitOps repo (production)
run: |
git clone https://x-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/gitops-repo.git
cd gitops-repo
yq e '.spec.template.spec.containers[0].image = "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"' \
-i apps/payment-service/production/deployment.yaml
git commit -am "chore: deploy payment-service ${{ github.sha }} to production"
git push
- name: Create GitHub release
uses: semantic-release/semantic-release@v23
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "✅ payment-service deployed to production",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*payment-service* deployed to production\nCommit: ${{ github.sha }}\nTriggered by: ${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
DevOps Framework at a Glance¶
┌─────────────────────────────────────────────────────────────────────┐
│ DevOps Delivery Framework │
├─────────┬────────────────────┬──────────────────────────────────────┤
│ Stage │ Gate (Must Pass) │ Key Tools │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Plan │ Requirements clear │ Jira, Confluence, Miro │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Code │ PR approved, │ GitHub/GitLab, SonarQube, Semgrep, │
│ │ SAST clean │ Snyk, pre-commit │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Build │ Unit tests pass, │ Maven/Poetry/npm, JFrog, Nexus, │
│ │ coverage > 80% │ JUnit, pytest │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Package │ Image scan clean, │ Docker, Trivy, Snyk, Harbor, Cosign │
│ │ image signed │ Syft (SBOM), ECR │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Test │ Integration pass, │ Pytest, Selenium, Playwright, k6, │
│ │ no HIGH CVEs │ OWASP ZAP, Pact, WireMock │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Release │ Version bumped, │ semantic-release, GitHub Releases, │
│ │ changelog updated │ Conventional Commits │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Deploy │ Smoke tests pass, │ Argo CD, Argo Rollouts, Helm, │
│ │ health checks OK │ Kubernetes, Istio │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Operate │ SLOs met │ Prometheus, Grafana, AlertManager, │
│ │ │ PagerDuty, Runbooks │
├─────────┼────────────────────┼──────────────────────────────────────┤
│ Monitor │ DORA targets met │ Datadog, Grafana, Loki, Tempo, │
│ │ │ Dynatrace, Elastic Stack │
└─────────┴────────────────────┴──────────────────────────────────────┘
↑_______________ Feedback Loop _______________↑
Summary¶
If you are starting from scratch, don't try to implement everything at once. Use this prioritized roadmap:
Month 1 — Foundation
- Set up version control with PR reviews required
- Add pre-commit hooks (linting, secret detection)
- Write basic unit tests (target 50% coverage)
- Containerize your application
- Set up a basic CI pipeline (lint → test → build)
Month 2 — Quality and Security
- Add integration tests
- Add SAST scanning to CI
- Add container image scanning
- Set up artifact registry
- Implement Conventional Commits + semantic versioning
Month 3 — Automation and Deployment
- Automate deployment to staging
- Add smoke tests post-deployment
- Implement blue-green or canary deployment
- Set up GitOps with Argo CD
Month 4+ — Observability and Improvement
- Implement structured logging
- Add Prometheus metrics (RED method)
- Set up distributed tracing (OpenTelemetry)
- Define and monitor SLOs
- Track and review DORA metrics monthly
- Run your first chaos engineering experiment
A DevOps framework is never finished — it evolves as your team grows, your product matures, and the technology landscape changes. The goal is not to have a perfect pipeline on day one, but to continuously close the feedback loop between production reality and development decisions. Start small, measure everything, and iterate.
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.