Engineering Standards for DevOps: The Complete Guide¶

A team without standards is a team that reinvents everything — every time, in every project. Standards are not bureaucracy. They are the codified answers to questions your team has already solved, so you can spend your energy solving new ones.

This guide covers thirteen engineering standards that separate high-performing DevOps teams from the rest. For each standard, we go beyond the "what" to explain the "why" and the "how" — with concrete tooling, configuration examples, and the decision frameworks you need to implement them in your organisation.

Why Engineering Standards Matter¶

The DORA (DevOps Research and Assessment) research programme has tracked software delivery performance across thousands of organisations for over a decade. Their finding is clear: the practices in this guide directly predict business outcomes.

  ELITE vs LOW PERFORMERS (DORA 2023 State of DevOps Report)

  ┌─────────────────────────────┬───────────────┬──────────────┐
  │ Metric                      │ Elite         │ Low          │
  ├─────────────────────────────┼───────────────┼──────────────┤
  │ Deployment Frequency        │ Multiple/day  │ < 1/month    │
  │ Lead Time for Changes       │ < 1 hour      │ > 6 months   │
  │ Change Failure Rate         │ 0–5%          │ 16–30%       │
  │ Time to Restore Service     │ < 1 hour      │ > 6 months   │
  └─────────────────────────────┴───────────────┴──────────────┘

  Elite performers are:
  → 208× more frequent deployers
  → 106× faster at recovering from incidents
  → 7× lower change failure rate
  → 1.8× more likely to meet or exceed commercial targets

The gap is not talent. It's practices. The thirteen standards below are the practices that make the difference.

The CALMS Framework — Culture First¶

Before any tool or process, DevOps requires a cultural shift. The CALMS framework describes the five pillars:

  C — Culture       Shared responsibility. Developers own operations.
                    "You build it, you run it." — Werner Vogels, Amazon CTO

  A — Automation    Automate everything that is repeatable.
                    If you do it twice, automate it.

  L — Lean          Eliminate waste. Small batches. Fast feedback.
                    Stop starting, start finishing.

  M — Measurement   You can't improve what you don't measure.
                    Instrument everything. Decide with data.

  S — Sharing       Share knowledge, tools, and post-mortems.
                    Blameless culture. Shared ownership.

Every standard in this guide serves one or more of these pillars. When a practice feels bureaucratic, check which CALMS pillar it serves — if it serves none, it shouldn't be a standard.

Standard 1: The Twelve-Factor Application¶

"Build software-as-a-service apps that use declarative formats for setup automation, have a clean contract with the underlying operating system, are suitable for deployment on modern cloud platforms, minimise divergence between development and production, and can scale up without significant changes to tooling, architecture, or development practices."

The 12-Factor App methodology is the baseline contract between your application code and the platform it runs on. Every service in your organisation should comply.

The three factors most frequently violated — and most impactful to fix:

Factor III: Config in the Environment¶

# BAD: config hardcoded or in committed files
DATABASE_URL = "postgresql://admin:secret@prod-db:5432/orders"
STRIPE_KEY   = "sk_live_abc123"

# GOOD: all config from environment variables
import os
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    database_url: str
    stripe_secret_key: str
    log_level: str = "INFO"
    debug: bool = False

settings = Settings()  # reads from env vars, .env file (not committed)

Factor XI: Logs as Event Streams¶

# BAD: writing to log files
import logging
logging.basicConfig(filename='/var/log/app.log', level=logging.INFO)

# GOOD: write to stdout, structured as JSON
import structlog, sys

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ],
    logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
)
log = structlog.get_logger()
log.info("order_placed", order_id="abc123", amount="99.99", currency="USD")

Output (queryable, parseable, monitorable):

{"timestamp": "2026-05-18T10:23:41Z", "level": "info", "event": "order_placed",
 "order_id": "abc123", "amount": "99.99", "currency": "USD"}

Factor VI: Stateless Processes¶

No state in memory between requests. Sessions in Redis. Files in object storage (S3/GCS). This is what enables horizontal scaling — any instance can serve any request.

Quick compliance check

Can you kill any running instance of your service right now without losing data or dropping in-flight requests? If yes, you're stateless. If no, find the state and externalise it.

Standard 2: Source Code Management — Trunk-Based Development¶

The Three Branching Strategies Compared¶

  GITFLOW (traditional):                TRUNK-BASED DEVELOPMENT:

  main ──────────────────────────      main ─────────────────────
  │                                         │    │    │
  └── develop ─────────────────            feat feat feat
      │           │                        (< 2 days)
      └── feature └── release             merge → main daily
          (weeks)     (weeks)

  GitHub Flow (middle ground):

  main ─────────────────────────
       │    │    │    │
      PR   PR   PR   PR        (short-lived, reviewed, merged)

The DORA research conclusion: Trunk-Based Development — working directly on main or in very short-lived branches (< 2 days) — is one of the strongest predictors of high software delivery performance.

Trunk-Based Development Implementation¶

  RULES FOR TRUNK-BASED DEVELOPMENT

  ✓  All developers commit to main at least once per day
  ✓  Feature branches live < 2 days before merging
  ✓  Feature toggles hide incomplete features from users
  ✓  The main branch is ALWAYS deployable
  ✓  Build must pass before merge — no broken main, ever
  ✓  Small commits — each commit is a coherent, reviewable unit

Feature Toggles¶

Feature toggles let you merge incomplete features to main without exposing them to users:

# Feature toggle implementation
import os

class FeatureFlags:
    @staticmethod
    def new_checkout_flow() -> bool:
        return os.getenv("FF_NEW_CHECKOUT_FLOW", "false").lower() == "true"

    @staticmethod
    def ai_product_recommendations() -> bool:
        return os.getenv("FF_AI_RECOMMENDATIONS", "false").lower() == "true"

# In application code
from feature_flags import FeatureFlags

def get_checkout_view(request):
    if FeatureFlags.new_checkout_flow():
        return new_checkout_view(request)   # behind flag — merge to main safely
    return legacy_checkout_view(request)   # default path

# Feature flag config per environment
# development/.env
FF_NEW_CHECKOUT_FLOW=true
FF_AI_RECOMMENDATIONS=true

# staging/.env
FF_NEW_CHECKOUT_FLOW=true
FF_AI_RECOMMENDATIONS=false

# production/.env
FF_NEW_CHECKOUT_FLOW=false   # flip to true when ready to release
FF_AI_RECOMMENDATIONS=false

Standard 3: Git Pre-Commit Hooks¶

Pre-commit hooks run automated checks before a commit is accepted. They catch issues at the cheapest possible point — before the CI pipeline, before code review, before the merge.

Setting Up Pre-Commit¶

# Install pre-commit
pip install pre-commit

# Create .pre-commit-config.yaml at repo root
cat > .pre-commit-config.yaml << 'EOF'
repos:
  # Code formatting
  - repo: https://github.com/psf/black
    rev: 24.4.2
    hooks:
      - id: black
        language_version: python3.12

  # Import sorting
  - repo: https://github.com/PyCQA/isort
    rev: 5.13.2
    hooks:
      - id: isort
        args: ["--profile", "black"]

  # Linting
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.4.4
    hooks:
      - id: ruff
        args: ["--fix"]

  # Type checking
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.10.0
    hooks:
      - id: mypy

  # Secret scanning
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.5.0
    hooks:
      - id: detect-secrets

  # General file hygiene
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: check-added-large-files
        args: ['--maxkb=500']
      - id: no-commit-to-branch
        args: ['--branch', 'main', '--branch', 'master']
EOF

# Install the hooks (runs once per developer)
pre-commit install
pre-commit install --hook-type commit-msg

Commit Message Standard — Conventional Commits¶

Enforce a consistent, machine-readable commit format that enables automatic changelog generation and semantic versioning:

  FORMAT: <type>(<scope>): <description>

  TYPES:
  feat     → New feature for the user (triggers MINOR version bump)
  fix      → Bug fix for the user (triggers PATCH version bump)
  docs     → Documentation only
  style    → Formatting, missing semicolons (no logic change)
  refactor → Code restructure (no feature or bug)
  test     → Adding or fixing tests
  chore    → Build process, dependency updates
  perf     → Performance improvement
  ci       → CI/CD configuration
  BREAKING CHANGE → Incompatible API change (triggers MAJOR version bump)

  EXAMPLES:
  feat(orders): add coupon code validation at checkout
  fix(payments): handle Stripe timeout without losing order state
  refactor(catalog): extract search logic into SearchService
  chore(deps): bump sqlalchemy from 2.0.29 to 2.0.30
  BREAKING CHANGE(api): rename /orders endpoint to /v2/orders

# commitlint .commitlintrc.yaml — enforce conventional commits in CI
extends:
  - '@commitlint/config-conventional'
rules:
  body-max-line-length: [2, always, 100]
  subject-min-length: [2, always, 20]   # minimum meaningful message
  references-empty: [1, never]           # warn if no Jira ticket reference

Standard 4: CI/CD Pipeline¶

Every commit to main should trigger a pipeline that verifies, tests, and deploys the change. The pipeline is your quality gate — it must pass before code reaches production.

Pipeline Architecture¶

  COMMIT → CI (verify) → CD-STAGING → CD-PRODUCTION

  ┌──────────────────────────────────────────────────────────────┐
  │                    CI PIPELINE (every commit)                 │
  │                                                              │
  │  Checkout → Install deps → Lint/Format → Unit Tests →        │
  │  SAST Scan → Build Image → Push to Registry → [tag: sha]     │
  └──────────────────────────────────────────────────────────────┘
                              │
                    (on main branch only)
                              ▼
  ┌──────────────────────────────────────────────────────────────┐
  │              CD-STAGING (automatic on main merge)             │
  │                                                              │
  │  Deploy to Staging → Integration Tests → DAST Scan →         │
  │  Smoke Tests → Performance Tests → [approval gate]           │
  └──────────────────────────────────────────────────────────────┘
                              │
                    (manual approval or auto)
                              ▼
  ┌──────────────────────────────────────────────────────────────┐
  │              CD-PRODUCTION (controlled)                       │
  │                                                              │
  │  Deploy (Canary 5%) → Monitor → Expand (25%) →              │
  │  Monitor → Full Rollout → Post-deploy Smoke Tests            │
  └──────────────────────────────────────────────────────────────┘

Complete GitHub Actions Pipeline¶

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
  PYTHON_VERSION: "3.12"

jobs:
  # ── STAGE 1: Code Quality ─────────────────────────────────────
  quality:
    name: Code Quality
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements-dev.txt

      - name: Lint (ruff)
        run: ruff check . --output-format=github

      - name: Format check (black)
        run: black --check .

      - name: Type check (mypy)
        run: mypy src/

      - name: Detect secrets
        uses: reviewdog/action-detect-secrets@v0.21.0

  # ── STAGE 2: Tests ────────────────────────────────────────────
  test:
    name: Tests
    runs-on: ubuntu-latest
    needs: quality
    services:
      postgres:
        image: postgres:16-alpine
        env:
          POSTGRES_DB: test_db
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
        ports:
          - 5432:5432
      redis:
        image: redis:7-alpine
        ports:
          - 6379:6379

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements-dev.txt

      - name: Run unit tests with coverage
        run: |
          pytest tests/unit/ \
            --cov=src \
            --cov-report=xml \
            --cov-report=term-missing \
            --cov-fail-under=80 \
            -v
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379/0

      - name: Run integration tests
        run: pytest tests/integration/ -v
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test_db

      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v4
        with:
          files: coverage.xml
          fail_ci_if_error: true

  # ── STAGE 3: Security Scan ────────────────────────────────────
  security:
    name: Security Scan
    runs-on: ubuntu-latest
    needs: quality
    steps:
      - uses: actions/checkout@v4

      - name: SAST — Semgrep
        uses: semgrep/semgrep-action@v1
        with:
          config: >-
            p/python
            p/owasp-top-ten
            p/secrets
        env:
          SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}

      - name: Dependency scan — pip-audit
        run: |
          pip install pip-audit
          pip-audit --requirement requirements.txt \
                    --format json \
                    --output pip-audit-report.json
          pip-audit --requirement requirements.txt  # fail on high CVEs

      - name: Container scan — Trivy
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          security-checks: vuln,secret
          severity: HIGH,CRITICAL
          exit-code: 1

  # ── STAGE 4: Build & Push Image ──────────────────────────────
  build:
    name: Build Container Image
    runs-on: ubuntu-latest
    needs: [test, security]
    permissions:
      contents: read
      packages: write
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
      image_digest: ${{ steps.push.outputs.digest }}

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=,format=long
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        id: push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ── STAGE 5: Deploy to Staging ────────────────────────────────
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: staging

    steps:
      - name: Deploy to Kubernetes (staging)
        run: |
          kubectl set image deployment/order-service \
            order-service=${{ needs.build.outputs.image_tag }} \
            --namespace=staging
          kubectl rollout status deployment/order-service \
            --namespace=staging --timeout=5m

      - name: Run smoke tests against staging
        run: |
          ./scripts/smoke-test.sh ${{ vars.STAGING_URL }}

  # ── STAGE 6: Deploy to Production ────────────────────────────
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production   # requires manual approval

    steps:
      - name: Deploy canary (5%)
        run: kubectl apply -f k8s/canary-5pct.yaml

      - name: Monitor canary (5 minutes)
        run: ./scripts/monitor-canary.sh --duration=300 --error-threshold=1

      - name: Promote to full rollout
        run: |
          kubectl set image deployment/order-service \
            order-service=${{ needs.build.outputs.image_tag }} \
            --namespace=production
          kubectl rollout status deployment/order-service \
            --namespace=production --timeout=10m

      - name: Post-deploy smoke tests
        run: ./scripts/smoke-test.sh ${{ vars.PRODUCTION_URL }}

Standard 5: The Testing Pyramid¶

Testing must be structured in a pyramid — many fast, cheap tests at the base; few slow, expensive tests at the top:

         ╱▔▔▔▔▔▔╲
        ╱  E2E    ╲         Few, slow, brittle
       ╱  Tests    ╲        5-10% of test suite
      ╱─────────────╲
     ╱ Integration   ╲      Some, moderate speed
    ╱    Tests        ╲     20-30% of test suite
   ╱─────────────────  ╲
  ╱    Unit Tests       ╲   Many, fast, isolated
 ╱                       ╲  60-70% of test suite
╱─────────────────────────╲

Unit Testing Standards¶

# Naming convention: test_<unit>_<condition>_<expected_result>
# Every test: Arrange → Act → Assert

def test_cart_apply_coupon_with_20pct_discount_reduces_total_correctly():
    # Arrange
    cart = Cart.create(customer_id=CustomerId(uuid4()))
    cart.add_book(
        book_id=BookId(uuid4()),
        title="Clean Code",
        price=Money(Decimal("100.00"), "USD"),
        quantity=1
    )

    # Act
    cart.apply_coupon(Coupon(code="SAVE20", discount_percent=20))

    # Assert
    assert cart.total == Money(Decimal("80.00"), "USD")


def test_cart_checkout_with_empty_cart_raises_empty_cart_error():
    # Arrange
    cart = Cart.create(customer_id=CustomerId(uuid4()))

    # Act / Assert
    with pytest.raises(EmptyCartError):
        cart.checkout()


# Test isolation: no database, no network, no filesystem
# All external dependencies are replaced with fakes/stubs
class TestOrderService:
    def setup_method(self):
        self.order_repo = InMemoryOrderRepository()   # no real DB
        self.email = FakeEmailService()               # no real email
        self.service = OrderService(self.order_repo, self.email)

Coverage Standards¶

# pyproject.toml — enforce coverage thresholds
[tool.pytest.ini_options]
addopts = "--cov=src --cov-fail-under=80"

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "if __name__ == .__main__.:",
    "raise NotImplementedError",
    "class .*\\bProtocol\\b",
    "@(abc\\.)?abstractmethod",
]

80% coverage is a floor, not a goal

80% coverage is the minimum gate — it prevents obvious gaps. But 80% coverage with bad tests (tests that never assert anything meaningful) is worse than 60% coverage with good tests. Aim for meaningful coverage: test the business logic paths, not just line execution.

Contract Testing — Verifying Service Integration¶

Contract tests verify that a consumer's expectations of a provider's API are met, without a full integration environment:

# Using Pact for consumer-driven contract testing

# Consumer side: Order Service expects this from Catalog Service
@pytest.fixture
def pact(pact_mock_server):
    return pact_mock_server

def test_get_book_price_from_catalog(pact):
    # Define what Order Service expects
    (pact
     .given("Book abc123 exists in catalog")
     .upon_receiving("a request for book price")
     .with_request("GET", "/books/abc123/price")
     .will_respond_with(200, body={
         "book_id": "abc123",
         "price": "29.99",
         "currency": "USD",
         "available": True,
     }))

    with pact:
        result = catalog_client.get_book_price("abc123")

    assert result.price == Decimal("29.99")
    assert result.currency == "USD"

# This contract is published to a Pact Broker.
# The Catalog Service runs "provider verification" to prove it satisfies the contract.
# If Catalog changes the API in a breaking way, verification fails — before production.

Standard 6: DevSecOps — Security Shifted Left¶

Security is not a phase at the end of development — it's a practice embedded at every stage. This is Shift Left Security: finding vulnerabilities earlier, when they are cheapest to fix.

  SHIFT LEFT: vulnerability fix cost by stage

  Design  →  Code  →  Build  →  Test  →  Staging  →  Production
  $1          $6       $65       $960     $3,000       $100,000+

  Source: IBM Systems Science Institute

SAST — Static Application Security Testing¶

SAST analyses source code for vulnerabilities without running the application:

# Semgrep — SAST in CI (.github/workflows/security.yml)
- name: SAST with Semgrep
  uses: semgrep/semgrep-action@v1
  with:
    config: >-
      p/python            # Python-specific rules
      p/owasp-top-ten     # OWASP Top 10 checks
      p/secrets           # Hardcoded secrets
      p/sql-injection     # SQL injection patterns
      p/xss               # Cross-site scripting

# SonarQube — code quality + security in one
sonar.projectKey=order-service
sonar.sources=src
sonar.tests=tests
sonar.python.coverage.reportPaths=coverage.xml
sonar.qualitygate.wait=true

# Quality Gate blocks deployment if:
# - Coverage < 80%
# - Duplicated code > 3%
# - Security Rating < A
# - Reliability Rating < A
# - New bugs introduced > 0

DAST — Dynamic Application Security Testing¶

DAST tests the running application by sending malicious inputs — like a real attacker would:

# OWASP ZAP in CI pipeline (against staging environment)
- name: DAST with OWASP ZAP
  uses: zaproxy/action-full-scan@v0.10.0
  with:
    target: ${{ vars.STAGING_URL }}
    rules_file_name: .zap/rules.tsv
    cmd_options: '-a -j'      # ajax spider, JSON output
    fail_action: true          # fail CI on medium+ risk findings

  .zap/rules.tsv — tune OWASP ZAP rules
  10020  IGNORE   # X-Frame-Options (handled by CDN)
  10038  WARN     # Content Security Policy (warn, not fail)
  40012  FAIL     # Reflected XSS — always fail
  40014  FAIL     # SQL Injection — always fail
  90020  FAIL     # Remote OS Command Injection — always fail

Dependency Scanning¶

# GitHub Dependabot — automatic dependency updates
# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: pip
    directory: /
    schedule:
      interval: weekly
    open-pull-requests-limit: 10
    labels:
      - dependencies
      - security
    ignore:
      - dependency-name: "*"
        update-types: ["version-update:semver-major"]   # manual for major bumps

  - package-ecosystem: docker
    directory: /
    schedule:
      interval: weekly

  - package-ecosystem: github-actions
    directory: /
    schedule:
      interval: weekly

# OWASP Dependency-Check — scan against National Vulnerability Database
docker run --rm \
  -v $(pwd):/src \
  -v $(pwd)/odc-reports:/report \
  owasp/dependency-check:latest \
  --scan /src \
  --format HTML \
  --format JSON \
  --out /report \
  --failOnCVSS 7 \    # fail on HIGH (7+) vulnerabilities
  --enableRetired

Secret Scanning¶

Secrets committed to git are compromised — even if you delete them, they're in git history forever.

# detect-secrets — prevent secrets before commit
# .pre-commit-config.yaml
- repo: https://github.com/Yelp/detect-secrets
  rev: v1.5.0
  hooks:
    - id: detect-secrets
      args: ['--baseline', '.secrets.baseline']

# Generate baseline (whitelist known false positives)
detect-secrets scan > .secrets.baseline
# Commit .secrets.baseline — it tracks known non-secrets
# CI fails if new secrets are found that aren't in baseline

# GitHub Advanced Security — secret scanning on push
# (enabled in repo settings)
# Alerts on: AWS keys, Stripe keys, GitHub tokens, 200+ pattern types
# Can block push if secret detected (push protection)

Container Security¶

# Dockerfile security best practices

# Use minimal, specific base images
FROM python:3.12-slim-bookworm AS base   # NOT python:latest, NOT python:3.12

# Run as non-root user
RUN groupadd --gid 1001 appgroup && \
    useradd --uid 1001 --gid appgroup --shell /bin/bash --create-home appuser

# Multi-stage build — production image has no build tools
FROM base AS builder
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM base AS production
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appgroup src/ /app/

USER appuser                    # never run as root
WORKDIR /app

# Read-only filesystem where possible
# Health check defined in Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health/live || exit 1

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# Trivy — container vulnerability scanning
- name: Scan container image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
    format: sarif
    output: trivy-results.sarif
    severity: HIGH,CRITICAL
    exit-code: 1    # fail pipeline on HIGH/CRITICAL CVEs

Standard 7: Semantic Versioning¶

Every release must have a version number that communicates the nature of the change to consumers.

  SEMANTIC VERSIONING FORMAT: MAJOR.MINOR.PATCH

  MAJOR: Breaking change — consumers must update their code
  MINOR: New feature, backward compatible — consumers can upgrade safely
  PATCH: Bug fix, backward compatible — consumers should upgrade

  Examples:
  1.0.0 → 1.0.1  Bug fix (PATCH)
  1.0.1 → 1.1.0  New feature added (MINOR)
  1.1.0 → 2.0.0  API contract changed (MAJOR)

  Pre-release:
  2.0.0-alpha.1   → Internal testing
  2.0.0-beta.2    → External beta testing
  2.0.0-rc.1      → Release candidate
  2.0.0            → Stable release

Automated Versioning with Conventional Commits¶

Using Conventional Commits (Standard 3), versioning can be automated:

# semantic-release: reads commit messages, determines version bump,
# creates git tag, generates CHANGELOG.md, publishes release

# package.json / .releaserc.yml
{
  "branches": ["main"],
  "plugins": [
    "@semantic-release/commit-analyzer",     # reads conventional commits
    "@semantic-release/release-notes-generator",
    "@semantic-release/changelog",           # generates CHANGELOG.md
    "@semantic-release/github",              # creates GitHub Release
    "@semantic-release/git"                  # commits version bump
  ]
}

# .github/workflows/release.yml
- name: Semantic Release
  uses: cycjimmy/semantic-release-action@v4
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  # On main merge:
  # feat: → bump MINOR, tag v1.2.0
  # fix:  → bump PATCH, tag v1.1.1
  # BREAKING CHANGE → bump MAJOR, tag v2.0.0

Auto-generated CHANGELOG entry:

## [1.2.0] - 2026-05-18

### Features
- **orders:** add coupon validation at checkout (#142)
- **catalog:** enable full-text book search (#138)

### Bug Fixes
- **payments:** handle Stripe timeout without losing order state (#145)

### Performance
- **catalog:** reduce search response time by 40% with index optimisation (#140)

Standard 8: Observability — Logs, Metrics, and Traces¶

You cannot operate what you cannot observe. Observability is the ability to understand the internal state of a system from its external outputs. It has three pillars:

  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
  │    LOGS      │  │   METRICS    │  │    TRACES    │
  │              │  │              │  │              │
  │ WHAT happened│  │ HOW is it    │  │ WHY is it    │
  │ (events)     │  │ performing?  │  │ slow?        │
  │              │  │ (numbers)    │  │ (causality)  │
  │ ELK Stack    │  │ Prometheus   │  │ Jaeger       │
  │ Loki         │  │ + Grafana    │  │ Zipkin       │
  │ CloudWatch   │  │ Datadog      │  │ OTEL Collector│
  └──────────────┘  └──────────────┘  └──────────────┘

Structured Logging Standard¶

Every log line must be a JSON object. Human-readable logs don't scale — you can't query, aggregate, or alert on plain text.

import structlog
import sys

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,     # correlation ID from context
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.stdlib.add_logger_name,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer(),
    ],
    logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
)

log = structlog.get_logger()

# Every log has: timestamp, level, service, correlation_id + business context
log.info("payment_processed",
    service="order-service",
    order_id="ord-abc123",
    customer_id="cust-xyz789",
    amount="99.99",
    currency="USD",
    gateway="stripe",
    duration_ms=234,
)

{"timestamp": "2026-05-18T10:23:41Z", "level": "info", "service": "order-service",
 "event": "payment_processed", "order_id": "ord-abc123", "customer_id": "cust-xyz789",
 "amount": "99.99", "currency": "USD", "gateway": "stripe", "duration_ms": 234,
 "correlation_id": "req-def456"}

Correlation IDs — Tracing Across Services¶

import uuid
from fastapi import FastAPI, Request
import structlog

app = FastAPI()

@app.middleware("http")
async def correlation_id_middleware(request: Request, call_next):
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())

    # Bind to structlog context — ALL log lines in this request get this ID
    structlog.contextvars.bind_contextvars(
        correlation_id=correlation_id,
        request_path=request.url.path,
        request_method=request.method,
    )

    response = await call_next(request)
    response.headers["X-Correlation-ID"] = correlation_id

    structlog.contextvars.unbind_contextvars("correlation_id")
    return response

Now every service in your system logs the same correlation_id. When debugging a request that touched 5 services, one query finds them all:

  Loki/Elasticsearch query:
  { correlation_id="req-def456" }   →  all logs from all services for this request

Prometheus Metrics¶

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics at module level
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_code"]
)
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration",
    ["method", "endpoint"],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1.0, 2.5, 5.0]
)
ACTIVE_ORDERS = Gauge(
    "orders_active_total",
    "Number of orders in processing state"
)
ORDER_PROCESSING_ERRORS = Counter(
    "order_processing_errors_total",
    "Order processing errors by type",
    ["error_type"]
)

# FastAPI middleware to instrument automatically
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status_code=response.status_code
    ).inc()
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

# Expose metrics endpoint
@app.get("/metrics")
def metrics():
    from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Grafana Dashboard Standards¶

Every service must have a dashboard covering the RED method:

  RED METHOD (for request-driven services):
  ✦ Rate     — requests per second
  ✦ Errors   — error rate (4xx, 5xx)
  ✦ Duration — latency (p50, p95, p99)

  USE method (for resource-driven services):
  ✦ Utilisation  — % time the resource is busy
  ✦ Saturation   — how much work is queued
  ✦ Errors       — error rate

# Grafana dashboard panels (minimum per service):
panels:
  - title: "Request Rate (req/s)"
    query: rate(http_requests_total[5m])

  - title: "Error Rate (%)"
    query: |
      sum(rate(http_requests_total{status_code=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) * 100

  - title: "Latency p99 (ms)"
    query: |
      histogram_quantile(0.99,
        rate(http_request_duration_seconds_bucket[5m])) * 1000

  - title: "Active Orders"
    query: orders_active_total

  - title: "Error Types"
    query: rate(order_processing_errors_total[5m])

SLOs — Service Level Objectives¶

Observability without SLOs is noise without signal. SLOs define what "good" looks like for your service:

# SLO definitions per service
service: order-service
slos:
  - name: availability
    description: "Orders API is available"
    target: 99.9%             # 8.7 hours downtime/year
    measurement: |
      1 - (rate(http_requests_total{status_code=~"5.."}[30d])
           / rate(http_requests_total[30d]))

  - name: latency
    description: "Order placement completes in < 2 seconds"
    target: 95%               # 95% of requests under 2s
    measurement: |
      histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[30d])) < 2

  - name: correctness
    description: "Orders are processed without errors"
    target: 99.5%
    measurement: |
      1 - (rate(order_processing_errors_total[30d])
           / rate(orders_placed_total[30d]))

Error Budget: Error Budget = 1 - SLO target

If your SLO is 99.9%, your error budget is 0.1% (43.8 minutes/month). When the budget is consumed, feature work stops and reliability work takes priority.

Standard 9: Kubernetes Deployment Standards¶

All services deploy to Kubernetes. These are the non-negotiable standards for every Deployment manifest.

# k8s/deployment.yaml — complete standard template
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
  labels:
    app: order-service
    version: "1.2.0"
    team: platform
spec:
  replicas: 3                   # minimum 3 for HA in production
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1               # allow 1 extra pod during update
      maxUnavailable: 0         # never reduce below desired count
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
        version: "1.2.0"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      # Security context — run as non-root
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001

      # Terminate gracefully
      terminationGracePeriodSeconds: 30

      containers:
      - name: order-service
        image: ghcr.io/mycompany/order-service:sha-abc1234  # exact SHA, not 'latest'
        ports:
        - containerPort: 8000

        # Resource limits — ALWAYS set these
        resources:
          requests:             # guaranteed allocation
            memory: "256Mi"
            cpu: "100m"
          limits:               # maximum allocation
            memory: "512Mi"
            cpu: "500m"

        # Environment from ConfigMap and Secrets
        envFrom:
        - configMapRef:
            name: order-service-config
        - secretRef:
            name: order-service-secrets
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

        # Health probes — all three, always
        startupProbe:
          httpGet:
            path: /health/live
            port: 8000
          failureThreshold: 30
          periodSeconds: 10

        livenessProbe:
          httpGet:
            path: /health/live
            port: 8000
          initialDelaySeconds: 0
          periodSeconds: 10
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
          periodSeconds: 5
          failureThreshold: 2

        # Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]

        # Read-only filesystem where possible
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]

        volumeMounts:
        - name: tmp
          mountPath: /tmp       # writable tmp if needed

      volumes:
      - name: tmp
        emptyDir: {}

      # Pod anti-affinity — spread across nodes
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - order-service
              topologyKey: kubernetes.io/hostname

# k8s/hpa.yaml — autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Standard 10: GitOps — Infrastructure as Code¶

Every infrastructure change must go through git. No manual kubectl apply. No console clicks in production.

  GITOPS WORKFLOW (with Argo CD):

  Developer                Git Repo              Argo CD          Kubernetes
     │                        │                     │                 │
     ├─ git push ────────────>│                     │                 │
     │                        ├─ PR / Review        │                 │
     │                        ├─ Merge to main ────>│                 │
     │                        │                     ├─ Detect diff    │
     │                        │                     ├─ kubectl apply >│
     │                        │                     │                 │
     │<─── Sync status ───────────────────────────────────────────────┤

# argocd/order-service-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: order-service
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/mycompany/k8s-manifests
    targetRevision: main
    path: services/order-service/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true         # remove resources deleted from git
      selfHeal: true      # revert manual changes in cluster
    syncOptions:
    - CreateNamespace=true

GitOps rules: - The git repository is the single source of truth for cluster state - Manual changes to the cluster are automatically reverted - All changes are auditable via git history - Rollback = git revert — simple, fast, safe

Standard 11: Incident Management¶

Incidents happen. The standard is not "never have incidents" but "respond effectively, recover quickly, and learn systematically."

Severity Levels¶

  ┌──────────┬─────────────────────────────────────────────────────────┐
  │ Severity │ Definition                       │ Response Time         │
  ├──────────┼──────────────────────────────────┼───────────────────────┤
  │ SEV-1    │ Production down, all users        │ Immediate — page on-  │
  │ (P0)     │ impacted, revenue loss            │ call, all hands       │
  ├──────────┼──────────────────────────────────┼───────────────────────┤
  │ SEV-2    │ Major feature broken, significant │ < 15 minutes          │
  │ (P1)     │ user impact, workaround available │ page on-call          │
  ├──────────┼──────────────────────────────────┼───────────────────────┤
  │ SEV-3    │ Minor feature degraded, some      │ < 2 hours             │
  │ (P2)     │ users affected                    │ notify team           │
  ├──────────┼──────────────────────────────────┼───────────────────────┤
  │ SEV-4    │ Minor issue, no user impact       │ Next business day     │
  │ (P3)     │                                   │ create ticket         │
  └──────────┴──────────────────────────────────┴───────────────────────┘

Runbooks — Operational Playbooks¶

Every service must have a runbook for its common failure modes:

# Runbook: Order Service — High Error Rate

## When to use this runbook
Alert: `order_processing_errors_total > 5% for 5 minutes`

## Diagnosis steps

1. Check Grafana dashboard: https://grafana.internal/d/order-service
   - Is the error rate uniform across all pods? → infrastructure issue
   - Is it one pod? → restart that pod first

2. Check error logs:
   ```
   kubectl logs -l app=order-service --namespace=production --since=10m \
     | jq 'select(.level == "error")' | tail -50
   ```

3. Check dependent services:
   - Payments service: https://grafana.internal/d/payments
   - Inventory service: https://grafana.internal/d/inventory
   - Database: https://grafana.internal/d/postgres

## Common causes and fixes

### Database connection pool exhausted
Symptom: `connection pool timeout` in logs
Fix: `kubectl scale deployment order-service --replicas=5`
     (more pods = more connections; increase pool_size if recurring)

### Stripe API rate limited
Symptom: `stripe.error.RateLimitError` in logs
Fix: Payment retries are automatic. Monitor for 5 minutes.
     If persistent: check Stripe status page.

### Memory leak causing OOM
Symptom: pods restarting, `OOMKilled` in events
Fix: `kubectl rollout restart deployment/order-service`
     Create P1 ticket for root cause investigation.

## Escalation
- Tier 1 (15 min): On-call engineer
- Tier 2 (30 min): Service owner (@order-team in Slack)
- Tier 3 (45 min): Engineering Director

Blameless Post-Mortems¶

After every SEV-1 and SEV-2 incident, a blameless post-mortem must be completed within 5 business days:

# Post-Mortem: Order Service Outage — 2026-05-15

**Duration:** 47 minutes (10:14 – 11:01 UTC)
**Severity:** SEV-1
**Impact:** 100% of order placements failed; ~2,400 orders affected
**Author:** [Engineer Name]
**Reviewed by:** [Team Lead]

## Timeline
- 10:14 — Automated alert: order error rate > 20%
- 10:17 — On-call engineer paged, begins investigation
- 10:25 — Root cause identified: DB connection pool exhausted
- 10:31 — Mitigation applied: connection pool size increased
- 10:45 — Error rate returned to baseline
- 11:01 — Incident declared resolved

## Root Cause
A database migration deployed at 09:50 created a missing index on the
`orders.customer_id` column. Without the index, a full table scan was
triggered on every order lookup. At peak traffic (10:10–10:14), this
exhausted the 20-connection pool. New requests could not acquire
connections and returned 500 errors.

## What went well
✓ Alert fired within 30 seconds of impact
✓ On-call engineer engaged quickly (< 3 minutes)
✓ Root cause identified in < 10 minutes via structured logs
✓ Team communicated clearly in the #incidents channel

## What went wrong
✗ Migration did not include index creation
✗ No performance test gate validates query performance before deploy
✗ Connection pool size was never revisited after DB was scaled up

## Action items
| Action                                      | Owner     | Due        |
|---------------------------------------------|-----------|------------|
| Add index to orders.customer_id             | @dev-team | 2026-05-16 |
| Add query performance check to CI pipeline  | @platform | 2026-05-18 |
| Review and update connection pool sizes     | @infra    | 2026-05-18 |
| Create runbook for DB connection exhaustion | @on-call  | 2026-05-19 |

## Lessons learned
- Database migrations must be reviewed for index coverage
- Performance regression tests should run against staging pre-deploy
- Connection pool size should be proportional to pod count

Blameless means blameless

A post-mortem that names individuals as root causes ("engineer X forgot to add the index") is not blameless and is not useful. Systems cause incidents, not people. The question is always: what system change would have prevented this?

Standard 12: Code Review Standards¶

Code review is not about finding bugs (tests do that) — it's about knowledge sharing, consistency, and design feedback.

The Author's Responsibilities¶

  BEFORE requesting review:
  ✓ Self-review your own diff first
  ✓ PR description explains WHY (not just what — git diff shows that)
  ✓ All CI checks pass — never request review on a failing PR
  ✓ PR is small enough to review in < 30 minutes
  ✓ Tests are included and pass
  ✓ Linked to the Jira/Linear ticket

  PR description template:
  ## What
  [1-2 sentences on what this change does]

  ## Why
  [The problem being solved — link to ticket]

  ## How to test
  [Steps to verify the change works]

  ## Screenshots / Evidence
  [Before/after for UI; log output for backend]

The Reviewer's Responsibilities¶

  REVIEW within:
  ✓ 4 hours for unblocking changes (SEV-2+)
  ✓ 24 hours for standard changes

  REVIEW for:
  ✓ Correctness: does it do what it says?
  ✓ Tests: are the right scenarios covered?
  ✓ Design: is this the right abstraction?
  ✓ Security: any obvious vulnerabilities?
  ✓ Observability: are errors logged and metrics emitted?

  NOT responsible for:
  ✗ Style (automated by linting)
  ✗ Formatting (automated by formatters)
  ✗ Finding all bugs (tests do that)

PR Size Standards¶

  IDEAL PR SIZE: < 400 lines changed

  > 400 lines:   Harder to review, misses more issues
  > 800 lines:   Review quality degrades significantly
  > 1000 lines:  "LGTM" reviews — nobody is reading this

  If your PR is large, split it:
  ✦ Refactor in one PR (no behaviour change)
  ✦ Feature in a second PR (behaviour added)

  A PR that says "refactor + feature" in the title
  should be two PRs.

Standard 13: Collaboration and Knowledge Management¶

Communication Standards¶

  SYNCHRONOUS (real-time):
  ✦ Slack / Teams — for time-sensitive, conversational
  ✦ Video calls — for decisions, complex discussions
  ✦ Do NOT use email for engineering team communication

  ASYNCHRONOUS (not real-time):
  ✦ Jira / Linear — for task tracking and sprint management
  ✦ Confluence / Notion — for documentation and decisions
  ✦ GitHub/GitLab — for code reviews and technical discussion
  ✦ Post-mortems — for incident learnings

  CHANNEL NAMING CONVENTION (Slack):
  #team-[team-name]        — team's primary channel
  #service-[service-name]  — alerts and deployments for a service
  #incident-[date]-[name]  — per-incident channels (auto-archived)
  #deploy-[env]            — deployment notifications
  #alerts-[severity]       — monitoring alerts by severity

Architecture Decision Records (ADRs)¶

Every significant technical decision must be documented as an ADR — a short, dated record that captures the decision, the context, and the rationale:

# ADR-0023: Use PostgreSQL for the Orders Service Database

Date: 2026-05-18
Status: Accepted
Deciders: [Team leads]

## Context
The Orders service needs a persistent store for order data.
Requirements: ACID transactions, complex queries, JSON support,
managed service availability on AWS.

## Decision
Use PostgreSQL (via AWS RDS) as the primary database.

## Alternatives considered
- MySQL: Less powerful JSON support, no native arrays
- MongoDB: Lacks ACID transactions across documents; operational complexity
- DynamoDB: Poor fit for complex relational queries; high cost at scale

## Consequences
+ Full ACID compliance for order state transitions
+ Rich query capabilities for reporting
+ Mature ecosystem, team familiarity
- Vertical scaling limits (mitigated by read replicas)
- Schema migrations require careful planning

Store ADRs in the repository at docs/adr/ — they are versioned alongside the code they describe.

The Complete Engineering Standards Checklist¶

Use this as a new service launch checklist and quarterly audit:

Summary — The Thirteen Standards¶

#	Standard	Core Tool	What It Prevents
1	12-Factor App	Config via env vars, stdout logs	Environment-specific builds, log sprawl
2	Trunk-Based Dev	Feature toggles, short branches	Merge conflicts, integration hell
3	Pre-Commit Hooks	pre-commit, detect-secrets	Bad code and secrets reaching CI
4	CI/CD Pipeline	GitHub Actions, Argo CD	Manual, error-prone deployments
5	Testing Pyramid	pytest, Pact	Bugs in production, slow feedback
6	DevSecOps	Semgrep, Trivy, OWASP ZAP	Security vulnerabilities reaching users
7	Semantic Versioning	semantic-release	Breaking changes without warning
8	Observability	Prometheus, Loki, Jaeger	Blind operations, slow incident response
9	Kubernetes Standards	Resource limits, probes, HPA	Outages, poor scaling, security gaps
10	GitOps	Argo CD	Manual drift, unaudited changes
11	Incident Management	Runbooks, post-mortems	Repeated incidents, slow recovery
12	Code Review	PR templates, size limits	Knowledge silos, poor design
13	Collaboration	ADRs, Slack standards	Decision loss, tribal knowledge

Standards compound. A team that applies all thirteen doesn't just add the benefits — it multiplies them. Reliable CI makes security scanning trustworthy. GitOps makes observability meaningful. Blameless post-mortems make runbooks better. The whole is greater than the sum of its parts.

Essential reading for going deeper: The Phoenix Project and The DevOps Handbook by Gene Kim et al., Accelerate by Nicole Forsgren et al., Site Reliability Engineering by Google, and Team Topologies by Skelton & Pais.

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.