Engineering Standards for DevOps: The Complete Guide¶
A team without standards is a team that reinvents everything — every time, in every project. Standards are not bureaucracy. They are the codified answers to questions your team has already solved, so you can spend your energy solving new ones.
This guide covers thirteen engineering standards that separate high-performing DevOps teams from the rest. For each standard, we go beyond the "what" to explain the "why" and the "how" — with concrete tooling, configuration examples, and the decision frameworks you need to implement them in your organisation.
Why Engineering Standards Matter¶
The DORA (DevOps Research and Assessment) research programme has tracked software delivery performance across thousands of organisations for over a decade. Their finding is clear: the practices in this guide directly predict business outcomes.
ELITE vs LOW PERFORMERS (DORA 2023 State of DevOps Report)
┌─────────────────────────────┬───────────────┬──────────────┐
│ Metric │ Elite │ Low │
├─────────────────────────────┼───────────────┼──────────────┤
│ Deployment Frequency │ Multiple/day │ < 1/month │
│ Lead Time for Changes │ < 1 hour │ > 6 months │
│ Change Failure Rate │ 0–5% │ 16–30% │
│ Time to Restore Service │ < 1 hour │ > 6 months │
└─────────────────────────────┴───────────────┴──────────────┘
Elite performers are:
→ 208× more frequent deployers
→ 106× faster at recovering from incidents
→ 7× lower change failure rate
→ 1.8× more likely to meet or exceed commercial targets
The gap is not talent. It's practices. The thirteen standards below are the practices that make the difference.
The CALMS Framework — Culture First¶
Before any tool or process, DevOps requires a cultural shift. The CALMS framework describes the five pillars:
C — Culture Shared responsibility. Developers own operations.
"You build it, you run it." — Werner Vogels, Amazon CTO
A — Automation Automate everything that is repeatable.
If you do it twice, automate it.
L — Lean Eliminate waste. Small batches. Fast feedback.
Stop starting, start finishing.
M — Measurement You can't improve what you don't measure.
Instrument everything. Decide with data.
S — Sharing Share knowledge, tools, and post-mortems.
Blameless culture. Shared ownership.
Every standard in this guide serves one or more of these pillars. When a practice feels bureaucratic, check which CALMS pillar it serves — if it serves none, it shouldn't be a standard.
Standard 1: The Twelve-Factor Application¶
"Build software-as-a-service apps that use declarative formats for setup automation, have a clean contract with the underlying operating system, are suitable for deployment on modern cloud platforms, minimise divergence between development and production, and can scale up without significant changes to tooling, architecture, or development practices."
The 12-Factor App methodology is the baseline contract between your application code and the platform it runs on. Every service in your organisation should comply.
The three factors most frequently violated — and most impactful to fix:
Factor III: Config in the Environment¶
# BAD: config hardcoded or in committed files
DATABASE_URL = "postgresql://admin:secret@prod-db:5432/orders"
STRIPE_KEY = "sk_live_abc123"
# GOOD: all config from environment variables
import os
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
database_url: str
stripe_secret_key: str
log_level: str = "INFO"
debug: bool = False
settings = Settings() # reads from env vars, .env file (not committed)
Factor XI: Logs as Event Streams¶
# BAD: writing to log files
import logging
logging.basicConfig(filename='/var/log/app.log', level=logging.INFO)
# GOOD: write to stdout, structured as JSON
import structlog, sys
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
)
log = structlog.get_logger()
log.info("order_placed", order_id="abc123", amount="99.99", currency="USD")
Output (queryable, parseable, monitorable):
{"timestamp": "2026-05-18T10:23:41Z", "level": "info", "event": "order_placed",
"order_id": "abc123", "amount": "99.99", "currency": "USD"}
Factor VI: Stateless Processes¶
No state in memory between requests. Sessions in Redis. Files in object storage (S3/GCS). This is what enables horizontal scaling — any instance can serve any request.
Quick compliance check
Can you kill any running instance of your service right now without losing data or dropping in-flight requests? If yes, you're stateless. If no, find the state and externalise it.
Standard 2: Source Code Management — Trunk-Based Development¶
The Three Branching Strategies Compared¶
GITFLOW (traditional): TRUNK-BASED DEVELOPMENT:
main ────────────────────────── main ─────────────────────
│ │ │ │
└── develop ───────────────── feat feat feat
│ │ (< 2 days)
└── feature └── release merge → main daily
(weeks) (weeks)
GitHub Flow (middle ground):
main ─────────────────────────
│ │ │ │
PR PR PR PR (short-lived, reviewed, merged)
The DORA research conclusion: Trunk-Based Development — working directly on main or in very short-lived branches (< 2 days) — is one of the strongest predictors of high software delivery performance.
Trunk-Based Development Implementation¶
RULES FOR TRUNK-BASED DEVELOPMENT
✓ All developers commit to main at least once per day
✓ Feature branches live < 2 days before merging
✓ Feature toggles hide incomplete features from users
✓ The main branch is ALWAYS deployable
✓ Build must pass before merge — no broken main, ever
✓ Small commits — each commit is a coherent, reviewable unit
Feature Toggles¶
Feature toggles let you merge incomplete features to main without exposing them to users:
# Feature toggle implementation
import os
class FeatureFlags:
@staticmethod
def new_checkout_flow() -> bool:
return os.getenv("FF_NEW_CHECKOUT_FLOW", "false").lower() == "true"
@staticmethod
def ai_product_recommendations() -> bool:
return os.getenv("FF_AI_RECOMMENDATIONS", "false").lower() == "true"
# In application code
from feature_flags import FeatureFlags
def get_checkout_view(request):
if FeatureFlags.new_checkout_flow():
return new_checkout_view(request) # behind flag — merge to main safely
return legacy_checkout_view(request) # default path
# Feature flag config per environment
# development/.env
FF_NEW_CHECKOUT_FLOW=true
FF_AI_RECOMMENDATIONS=true
# staging/.env
FF_NEW_CHECKOUT_FLOW=true
FF_AI_RECOMMENDATIONS=false
# production/.env
FF_NEW_CHECKOUT_FLOW=false # flip to true when ready to release
FF_AI_RECOMMENDATIONS=false
Standard 3: Git Pre-Commit Hooks¶
Pre-commit hooks run automated checks before a commit is accepted. They catch issues at the cheapest possible point — before the CI pipeline, before code review, before the merge.
Setting Up Pre-Commit¶
# Install pre-commit
pip install pre-commit
# Create .pre-commit-config.yaml at repo root
cat > .pre-commit-config.yaml << 'EOF'
repos:
# Code formatting
- repo: https://github.com/psf/black
rev: 24.4.2
hooks:
- id: black
language_version: python3.12
# Import sorting
- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
- id: isort
args: ["--profile", "black"]
# Linting
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.4
hooks:
- id: ruff
args: ["--fix"]
# Type checking
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.10.0
hooks:
- id: mypy
# Secret scanning
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
# General file hygiene
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: check-added-large-files
args: ['--maxkb=500']
- id: no-commit-to-branch
args: ['--branch', 'main', '--branch', 'master']
EOF
# Install the hooks (runs once per developer)
pre-commit install
pre-commit install --hook-type commit-msg
Commit Message Standard — Conventional Commits¶
Enforce a consistent, machine-readable commit format that enables automatic changelog generation and semantic versioning:
FORMAT: <type>(<scope>): <description>
TYPES:
feat → New feature for the user (triggers MINOR version bump)
fix → Bug fix for the user (triggers PATCH version bump)
docs → Documentation only
style → Formatting, missing semicolons (no logic change)
refactor → Code restructure (no feature or bug)
test → Adding or fixing tests
chore → Build process, dependency updates
perf → Performance improvement
ci → CI/CD configuration
BREAKING CHANGE → Incompatible API change (triggers MAJOR version bump)
EXAMPLES:
feat(orders): add coupon code validation at checkout
fix(payments): handle Stripe timeout without losing order state
refactor(catalog): extract search logic into SearchService
chore(deps): bump sqlalchemy from 2.0.29 to 2.0.30
BREAKING CHANGE(api): rename /orders endpoint to /v2/orders
# commitlint .commitlintrc.yaml — enforce conventional commits in CI
extends:
- '@commitlint/config-conventional'
rules:
body-max-line-length: [2, always, 100]
subject-min-length: [2, always, 20] # minimum meaningful message
references-empty: [1, never] # warn if no Jira ticket reference
Standard 4: CI/CD Pipeline¶
Every commit to main should trigger a pipeline that verifies, tests, and deploys the change. The pipeline is your quality gate — it must pass before code reaches production.
Pipeline Architecture¶
COMMIT → CI (verify) → CD-STAGING → CD-PRODUCTION
┌──────────────────────────────────────────────────────────────┐
│ CI PIPELINE (every commit) │
│ │
│ Checkout → Install deps → Lint/Format → Unit Tests → │
│ SAST Scan → Build Image → Push to Registry → [tag: sha] │
└──────────────────────────────────────────────────────────────┘
│
(on main branch only)
▼
┌──────────────────────────────────────────────────────────────┐
│ CD-STAGING (automatic on main merge) │
│ │
│ Deploy to Staging → Integration Tests → DAST Scan → │
│ Smoke Tests → Performance Tests → [approval gate] │
└──────────────────────────────────────────────────────────────┘
│
(manual approval or auto)
▼
┌──────────────────────────────────────────────────────────────┐
│ CD-PRODUCTION (controlled) │
│ │
│ Deploy (Canary 5%) → Monitor → Expand (25%) → │
│ Monitor → Full Rollout → Post-deploy Smoke Tests │
└──────────────────────────────────────────────────────────────┘
Complete GitHub Actions Pipeline¶
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
PYTHON_VERSION: "3.12"
jobs:
# ── STAGE 1: Code Quality ─────────────────────────────────────
quality:
name: Code Quality
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements-dev.txt
- name: Lint (ruff)
run: ruff check . --output-format=github
- name: Format check (black)
run: black --check .
- name: Type check (mypy)
run: mypy src/
- name: Detect secrets
uses: reviewdog/action-detect-secrets@v0.21.0
# ── STAGE 2: Tests ────────────────────────────────────────────
test:
name: Tests
runs-on: ubuntu-latest
needs: quality
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_DB: test_db
POSTGRES_USER: test
POSTGRES_PASSWORD: test
options: >-
--health-cmd pg_isready
--health-interval 10s
ports:
- 5432:5432
redis:
image: redis:7-alpine
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements-dev.txt
- name: Run unit tests with coverage
run: |
pytest tests/unit/ \
--cov=src \
--cov-report=xml \
--cov-report=term-missing \
--cov-fail-under=80 \
-v
env:
DATABASE_URL: postgresql://test:test@localhost:5432/test_db
REDIS_URL: redis://localhost:6379/0
- name: Run integration tests
run: pytest tests/integration/ -v
env:
DATABASE_URL: postgresql://test:test@localhost:5432/test_db
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: coverage.xml
fail_ci_if_error: true
# ── STAGE 3: Security Scan ────────────────────────────────────
security:
name: Security Scan
runs-on: ubuntu-latest
needs: quality
steps:
- uses: actions/checkout@v4
- name: SAST — Semgrep
uses: semgrep/semgrep-action@v1
with:
config: >-
p/python
p/owasp-top-ten
p/secrets
env:
SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}
- name: Dependency scan — pip-audit
run: |
pip install pip-audit
pip-audit --requirement requirements.txt \
--format json \
--output pip-audit-report.json
pip-audit --requirement requirements.txt # fail on high CVEs
- name: Container scan — Trivy
uses: aquasecurity/trivy-action@master
with:
scan-type: fs
security-checks: vuln,secret
severity: HIGH,CRITICAL
exit-code: 1
# ── STAGE 4: Build & Push Image ──────────────────────────────
build:
name: Build Container Image
runs-on: ubuntu-latest
needs: [test, security]
permissions:
contents: read
packages: write
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
image_digest: ${{ steps.push.outputs.digest }}
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract Docker metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=,format=long
type=ref,event=branch
type=semver,pattern={{version}}
- name: Build and push
id: push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ── STAGE 5: Deploy to Staging ────────────────────────────────
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- name: Deploy to Kubernetes (staging)
run: |
kubectl set image deployment/order-service \
order-service=${{ needs.build.outputs.image_tag }} \
--namespace=staging
kubectl rollout status deployment/order-service \
--namespace=staging --timeout=5m
- name: Run smoke tests against staging
run: |
./scripts/smoke-test.sh ${{ vars.STAGING_URL }}
# ── STAGE 6: Deploy to Production ────────────────────────────
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: deploy-staging
environment: production # requires manual approval
steps:
- name: Deploy canary (5%)
run: kubectl apply -f k8s/canary-5pct.yaml
- name: Monitor canary (5 minutes)
run: ./scripts/monitor-canary.sh --duration=300 --error-threshold=1
- name: Promote to full rollout
run: |
kubectl set image deployment/order-service \
order-service=${{ needs.build.outputs.image_tag }} \
--namespace=production
kubectl rollout status deployment/order-service \
--namespace=production --timeout=10m
- name: Post-deploy smoke tests
run: ./scripts/smoke-test.sh ${{ vars.PRODUCTION_URL }}
Standard 5: The Testing Pyramid¶
Testing must be structured in a pyramid — many fast, cheap tests at the base; few slow, expensive tests at the top:
╱▔▔▔▔▔▔╲
╱ E2E ╲ Few, slow, brittle
╱ Tests ╲ 5-10% of test suite
╱─────────────╲
╱ Integration ╲ Some, moderate speed
╱ Tests ╲ 20-30% of test suite
╱───────────────── ╲
╱ Unit Tests ╲ Many, fast, isolated
╱ ╲ 60-70% of test suite
╱─────────────────────────╲
Unit Testing Standards¶
# Naming convention: test_<unit>_<condition>_<expected_result>
# Every test: Arrange → Act → Assert
def test_cart_apply_coupon_with_20pct_discount_reduces_total_correctly():
# Arrange
cart = Cart.create(customer_id=CustomerId(uuid4()))
cart.add_book(
book_id=BookId(uuid4()),
title="Clean Code",
price=Money(Decimal("100.00"), "USD"),
quantity=1
)
# Act
cart.apply_coupon(Coupon(code="SAVE20", discount_percent=20))
# Assert
assert cart.total == Money(Decimal("80.00"), "USD")
def test_cart_checkout_with_empty_cart_raises_empty_cart_error():
# Arrange
cart = Cart.create(customer_id=CustomerId(uuid4()))
# Act / Assert
with pytest.raises(EmptyCartError):
cart.checkout()
# Test isolation: no database, no network, no filesystem
# All external dependencies are replaced with fakes/stubs
class TestOrderService:
def setup_method(self):
self.order_repo = InMemoryOrderRepository() # no real DB
self.email = FakeEmailService() # no real email
self.service = OrderService(self.order_repo, self.email)
Coverage Standards¶
# pyproject.toml — enforce coverage thresholds
[tool.pytest.ini_options]
addopts = "--cov=src --cov-fail-under=80"
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"if __name__ == .__main__.:",
"raise NotImplementedError",
"class .*\\bProtocol\\b",
"@(abc\\.)?abstractmethod",
]
80% coverage is a floor, not a goal
80% coverage is the minimum gate — it prevents obvious gaps. But 80% coverage with bad tests (tests that never assert anything meaningful) is worse than 60% coverage with good tests. Aim for meaningful coverage: test the business logic paths, not just line execution.
Contract Testing — Verifying Service Integration¶
Contract tests verify that a consumer's expectations of a provider's API are met, without a full integration environment:
# Using Pact for consumer-driven contract testing
# Consumer side: Order Service expects this from Catalog Service
@pytest.fixture
def pact(pact_mock_server):
return pact_mock_server
def test_get_book_price_from_catalog(pact):
# Define what Order Service expects
(pact
.given("Book abc123 exists in catalog")
.upon_receiving("a request for book price")
.with_request("GET", "/books/abc123/price")
.will_respond_with(200, body={
"book_id": "abc123",
"price": "29.99",
"currency": "USD",
"available": True,
}))
with pact:
result = catalog_client.get_book_price("abc123")
assert result.price == Decimal("29.99")
assert result.currency == "USD"
# This contract is published to a Pact Broker.
# The Catalog Service runs "provider verification" to prove it satisfies the contract.
# If Catalog changes the API in a breaking way, verification fails — before production.
Standard 6: DevSecOps — Security Shifted Left¶
Security is not a phase at the end of development — it's a practice embedded at every stage. This is Shift Left Security: finding vulnerabilities earlier, when they are cheapest to fix.
SHIFT LEFT: vulnerability fix cost by stage
Design → Code → Build → Test → Staging → Production
$1 $6 $65 $960 $3,000 $100,000+
Source: IBM Systems Science Institute
SAST — Static Application Security Testing¶
SAST analyses source code for vulnerabilities without running the application:
# Semgrep — SAST in CI (.github/workflows/security.yml)
- name: SAST with Semgrep
uses: semgrep/semgrep-action@v1
with:
config: >-
p/python # Python-specific rules
p/owasp-top-ten # OWASP Top 10 checks
p/secrets # Hardcoded secrets
p/sql-injection # SQL injection patterns
p/xss # Cross-site scripting
# SonarQube — code quality + security in one
sonar.projectKey=order-service
sonar.sources=src
sonar.tests=tests
sonar.python.coverage.reportPaths=coverage.xml
sonar.qualitygate.wait=true
# Quality Gate blocks deployment if:
# - Coverage < 80%
# - Duplicated code > 3%
# - Security Rating < A
# - Reliability Rating < A
# - New bugs introduced > 0
DAST — Dynamic Application Security Testing¶
DAST tests the running application by sending malicious inputs — like a real attacker would:
# OWASP ZAP in CI pipeline (against staging environment)
- name: DAST with OWASP ZAP
uses: zaproxy/action-full-scan@v0.10.0
with:
target: ${{ vars.STAGING_URL }}
rules_file_name: .zap/rules.tsv
cmd_options: '-a -j' # ajax spider, JSON output
fail_action: true # fail CI on medium+ risk findings
.zap/rules.tsv — tune OWASP ZAP rules
10020 IGNORE # X-Frame-Options (handled by CDN)
10038 WARN # Content Security Policy (warn, not fail)
40012 FAIL # Reflected XSS — always fail
40014 FAIL # SQL Injection — always fail
90020 FAIL # Remote OS Command Injection — always fail
Dependency Scanning¶
# GitHub Dependabot — automatic dependency updates
# .github/dependabot.yml
version: 2
updates:
- package-ecosystem: pip
directory: /
schedule:
interval: weekly
open-pull-requests-limit: 10
labels:
- dependencies
- security
ignore:
- dependency-name: "*"
update-types: ["version-update:semver-major"] # manual for major bumps
- package-ecosystem: docker
directory: /
schedule:
interval: weekly
- package-ecosystem: github-actions
directory: /
schedule:
interval: weekly
# OWASP Dependency-Check — scan against National Vulnerability Database
docker run --rm \
-v $(pwd):/src \
-v $(pwd)/odc-reports:/report \
owasp/dependency-check:latest \
--scan /src \
--format HTML \
--format JSON \
--out /report \
--failOnCVSS 7 \ # fail on HIGH (7+) vulnerabilities
--enableRetired
Secret Scanning¶
Secrets committed to git are compromised — even if you delete them, they're in git history forever.
# detect-secrets — prevent secrets before commit
# .pre-commit-config.yaml
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
# Generate baseline (whitelist known false positives)
detect-secrets scan > .secrets.baseline
# Commit .secrets.baseline — it tracks known non-secrets
# CI fails if new secrets are found that aren't in baseline
# GitHub Advanced Security — secret scanning on push
# (enabled in repo settings)
# Alerts on: AWS keys, Stripe keys, GitHub tokens, 200+ pattern types
# Can block push if secret detected (push protection)
Container Security¶
# Dockerfile security best practices
# Use minimal, specific base images
FROM python:3.12-slim-bookworm AS base # NOT python:latest, NOT python:3.12
# Run as non-root user
RUN groupadd --gid 1001 appgroup && \
useradd --uid 1001 --gid appgroup --shell /bin/bash --create-home appuser
# Multi-stage build — production image has no build tools
FROM base AS builder
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
FROM base AS production
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appgroup src/ /app/
USER appuser # never run as root
WORKDIR /app
# Read-only filesystem where possible
# Health check defined in Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health/live || exit 1
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# Trivy — container vulnerability scanning
- name: Scan container image
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: sarif
output: trivy-results.sarif
severity: HIGH,CRITICAL
exit-code: 1 # fail pipeline on HIGH/CRITICAL CVEs
Standard 7: Semantic Versioning¶
Every release must have a version number that communicates the nature of the change to consumers.
SEMANTIC VERSIONING FORMAT: MAJOR.MINOR.PATCH
MAJOR: Breaking change — consumers must update their code
MINOR: New feature, backward compatible — consumers can upgrade safely
PATCH: Bug fix, backward compatible — consumers should upgrade
Examples:
1.0.0 → 1.0.1 Bug fix (PATCH)
1.0.1 → 1.1.0 New feature added (MINOR)
1.1.0 → 2.0.0 API contract changed (MAJOR)
Pre-release:
2.0.0-alpha.1 → Internal testing
2.0.0-beta.2 → External beta testing
2.0.0-rc.1 → Release candidate
2.0.0 → Stable release
Automated Versioning with Conventional Commits¶
Using Conventional Commits (Standard 3), versioning can be automated:
# semantic-release: reads commit messages, determines version bump,
# creates git tag, generates CHANGELOG.md, publishes release
# package.json / .releaserc.yml
{
"branches": ["main"],
"plugins": [
"@semantic-release/commit-analyzer", # reads conventional commits
"@semantic-release/release-notes-generator",
"@semantic-release/changelog", # generates CHANGELOG.md
"@semantic-release/github", # creates GitHub Release
"@semantic-release/git" # commits version bump
]
}
# .github/workflows/release.yml
- name: Semantic Release
uses: cycjimmy/semantic-release-action@v4
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# On main merge:
# feat: → bump MINOR, tag v1.2.0
# fix: → bump PATCH, tag v1.1.1
# BREAKING CHANGE → bump MAJOR, tag v2.0.0
Auto-generated CHANGELOG entry:
## [1.2.0] - 2026-05-18
### Features
- **orders:** add coupon validation at checkout (#142)
- **catalog:** enable full-text book search (#138)
### Bug Fixes
- **payments:** handle Stripe timeout without losing order state (#145)
### Performance
- **catalog:** reduce search response time by 40% with index optimisation (#140)
Standard 8: Observability — Logs, Metrics, and Traces¶
You cannot operate what you cannot observe. Observability is the ability to understand the internal state of a system from its external outputs. It has three pillars:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LOGS │ │ METRICS │ │ TRACES │
│ │ │ │ │ │
│ WHAT happened│ │ HOW is it │ │ WHY is it │
│ (events) │ │ performing? │ │ slow? │
│ │ │ (numbers) │ │ (causality) │
│ ELK Stack │ │ Prometheus │ │ Jaeger │
│ Loki │ │ + Grafana │ │ Zipkin │
│ CloudWatch │ │ Datadog │ │ OTEL Collector│
└──────────────┘ └──────────────┘ └──────────────┘
Structured Logging Standard¶
Every log line must be a JSON object. Human-readable logs don't scale — you can't query, aggregate, or alert on plain text.
import structlog
import sys
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars, # correlation ID from context
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.stdlib.add_logger_name,
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
)
log = structlog.get_logger()
# Every log has: timestamp, level, service, correlation_id + business context
log.info("payment_processed",
service="order-service",
order_id="ord-abc123",
customer_id="cust-xyz789",
amount="99.99",
currency="USD",
gateway="stripe",
duration_ms=234,
)
{"timestamp": "2026-05-18T10:23:41Z", "level": "info", "service": "order-service",
"event": "payment_processed", "order_id": "ord-abc123", "customer_id": "cust-xyz789",
"amount": "99.99", "currency": "USD", "gateway": "stripe", "duration_ms": 234,
"correlation_id": "req-def456"}
Correlation IDs — Tracing Across Services¶
import uuid
from fastapi import FastAPI, Request
import structlog
app = FastAPI()
@app.middleware("http")
async def correlation_id_middleware(request: Request, call_next):
correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
# Bind to structlog context — ALL log lines in this request get this ID
structlog.contextvars.bind_contextvars(
correlation_id=correlation_id,
request_path=request.url.path,
request_method=request.method,
)
response = await call_next(request)
response.headers["X-Correlation-ID"] = correlation_id
structlog.contextvars.unbind_contextvars("correlation_id")
return response
Now every service in your system logs the same correlation_id. When debugging a request that touched 5 services, one query finds them all:
Loki/Elasticsearch query:
{ correlation_id="req-def456" } → all logs from all services for this request
Prometheus Metrics¶
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics at module level
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status_code"]
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration",
["method", "endpoint"],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1.0, 2.5, 5.0]
)
ACTIVE_ORDERS = Gauge(
"orders_active_total",
"Number of orders in processing state"
)
ORDER_PROCESSING_ERRORS = Counter(
"order_processing_errors_total",
"Order processing errors by type",
["error_type"]
)
# FastAPI middleware to instrument automatically
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status_code=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
# Expose metrics endpoint
@app.get("/metrics")
def metrics():
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
Grafana Dashboard Standards¶
Every service must have a dashboard covering the RED method:
RED METHOD (for request-driven services):
✦ Rate — requests per second
✦ Errors — error rate (4xx, 5xx)
✦ Duration — latency (p50, p95, p99)
USE method (for resource-driven services):
✦ Utilisation — % time the resource is busy
✦ Saturation — how much work is queued
✦ Errors — error rate
# Grafana dashboard panels (minimum per service):
panels:
- title: "Request Rate (req/s)"
query: rate(http_requests_total[5m])
- title: "Error Rate (%)"
query: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
- title: "Latency p99 (ms)"
query: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])) * 1000
- title: "Active Orders"
query: orders_active_total
- title: "Error Types"
query: rate(order_processing_errors_total[5m])
SLOs — Service Level Objectives¶
Observability without SLOs is noise without signal. SLOs define what "good" looks like for your service:
# SLO definitions per service
service: order-service
slos:
- name: availability
description: "Orders API is available"
target: 99.9% # 8.7 hours downtime/year
measurement: |
1 - (rate(http_requests_total{status_code=~"5.."}[30d])
/ rate(http_requests_total[30d]))
- name: latency
description: "Order placement completes in < 2 seconds"
target: 95% # 95% of requests under 2s
measurement: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[30d])) < 2
- name: correctness
description: "Orders are processed without errors"
target: 99.5%
measurement: |
1 - (rate(order_processing_errors_total[30d])
/ rate(orders_placed_total[30d]))
Error Budget: Error Budget = 1 - SLO target
If your SLO is 99.9%, your error budget is 0.1% (43.8 minutes/month). When the budget is consumed, feature work stops and reliability work takes priority.
Standard 9: Kubernetes Deployment Standards¶
All services deploy to Kubernetes. These are the non-negotiable standards for every Deployment manifest.
# k8s/deployment.yaml — complete standard template
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
labels:
app: order-service
version: "1.2.0"
team: platform
spec:
replicas: 3 # minimum 3 for HA in production
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # allow 1 extra pod during update
maxUnavailable: 0 # never reduce below desired count
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
version: "1.2.0"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
# Security context — run as non-root
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
# Terminate gracefully
terminationGracePeriodSeconds: 30
containers:
- name: order-service
image: ghcr.io/mycompany/order-service:sha-abc1234 # exact SHA, not 'latest'
ports:
- containerPort: 8000
# Resource limits — ALWAYS set these
resources:
requests: # guaranteed allocation
memory: "256Mi"
cpu: "100m"
limits: # maximum allocation
memory: "512Mi"
cpu: "500m"
# Environment from ConfigMap and Secrets
envFrom:
- configMapRef:
name: order-service-config
- secretRef:
name: order-service-secrets
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# Health probes — all three, always
startupProbe:
httpGet:
path: /health/live
port: 8000
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8000
periodSeconds: 5
failureThreshold: 2
# Graceful shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
# Read-only filesystem where possible
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp # writable tmp if needed
volumes:
- name: tmp
emptyDir: {}
# Pod anti-affinity — spread across nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- order-service
topologyKey: kubernetes.io/hostname
# k8s/hpa.yaml — autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Standard 10: GitOps — Infrastructure as Code¶
Every infrastructure change must go through git. No manual kubectl apply. No console clicks in production.
GITOPS WORKFLOW (with Argo CD):
Developer Git Repo Argo CD Kubernetes
│ │ │ │
├─ git push ────────────>│ │ │
│ ├─ PR / Review │ │
│ ├─ Merge to main ────>│ │
│ │ ├─ Detect diff │
│ │ ├─ kubectl apply >│
│ │ │ │
│<─── Sync status ───────────────────────────────────────────────┤
# argocd/order-service-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: order-service
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/mycompany/k8s-manifests
targetRevision: main
path: services/order-service/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # remove resources deleted from git
selfHeal: true # revert manual changes in cluster
syncOptions:
- CreateNamespace=true
GitOps rules: - The git repository is the single source of truth for cluster state - Manual changes to the cluster are automatically reverted - All changes are auditable via git history - Rollback = git revert — simple, fast, safe
Standard 11: Incident Management¶
Incidents happen. The standard is not "never have incidents" but "respond effectively, recover quickly, and learn systematically."
Severity Levels¶
┌──────────┬─────────────────────────────────────────────────────────┐
│ Severity │ Definition │ Response Time │
├──────────┼──────────────────────────────────┼───────────────────────┤
│ SEV-1 │ Production down, all users │ Immediate — page on- │
│ (P0) │ impacted, revenue loss │ call, all hands │
├──────────┼──────────────────────────────────┼───────────────────────┤
│ SEV-2 │ Major feature broken, significant │ < 15 minutes │
│ (P1) │ user impact, workaround available │ page on-call │
├──────────┼──────────────────────────────────┼───────────────────────┤
│ SEV-3 │ Minor feature degraded, some │ < 2 hours │
│ (P2) │ users affected │ notify team │
├──────────┼──────────────────────────────────┼───────────────────────┤
│ SEV-4 │ Minor issue, no user impact │ Next business day │
│ (P3) │ │ create ticket │
└──────────┴──────────────────────────────────┴───────────────────────┘
Runbooks — Operational Playbooks¶
Every service must have a runbook for its common failure modes:
# Runbook: Order Service — High Error Rate
## When to use this runbook
Alert: `order_processing_errors_total > 5% for 5 minutes`
## Diagnosis steps
1. Check Grafana dashboard: https://grafana.internal/d/order-service
- Is the error rate uniform across all pods? → infrastructure issue
- Is it one pod? → restart that pod first
2. Check error logs:
```
kubectl logs -l app=order-service --namespace=production --since=10m \
| jq 'select(.level == "error")' | tail -50
```
3. Check dependent services:
- Payments service: https://grafana.internal/d/payments
- Inventory service: https://grafana.internal/d/inventory
- Database: https://grafana.internal/d/postgres
## Common causes and fixes
### Database connection pool exhausted
Symptom: `connection pool timeout` in logs
Fix: `kubectl scale deployment order-service --replicas=5`
(more pods = more connections; increase pool_size if recurring)
### Stripe API rate limited
Symptom: `stripe.error.RateLimitError` in logs
Fix: Payment retries are automatic. Monitor for 5 minutes.
If persistent: check Stripe status page.
### Memory leak causing OOM
Symptom: pods restarting, `OOMKilled` in events
Fix: `kubectl rollout restart deployment/order-service`
Create P1 ticket for root cause investigation.
## Escalation
- Tier 1 (15 min): On-call engineer
- Tier 2 (30 min): Service owner (@order-team in Slack)
- Tier 3 (45 min): Engineering Director
Blameless Post-Mortems¶
After every SEV-1 and SEV-2 incident, a blameless post-mortem must be completed within 5 business days:
# Post-Mortem: Order Service Outage — 2026-05-15
**Duration:** 47 minutes (10:14 – 11:01 UTC)
**Severity:** SEV-1
**Impact:** 100% of order placements failed; ~2,400 orders affected
**Author:** [Engineer Name]
**Reviewed by:** [Team Lead]
## Timeline
- 10:14 — Automated alert: order error rate > 20%
- 10:17 — On-call engineer paged, begins investigation
- 10:25 — Root cause identified: DB connection pool exhausted
- 10:31 — Mitigation applied: connection pool size increased
- 10:45 — Error rate returned to baseline
- 11:01 — Incident declared resolved
## Root Cause
A database migration deployed at 09:50 created a missing index on the
`orders.customer_id` column. Without the index, a full table scan was
triggered on every order lookup. At peak traffic (10:10–10:14), this
exhausted the 20-connection pool. New requests could not acquire
connections and returned 500 errors.
## What went well
✓ Alert fired within 30 seconds of impact
✓ On-call engineer engaged quickly (< 3 minutes)
✓ Root cause identified in < 10 minutes via structured logs
✓ Team communicated clearly in the #incidents channel
## What went wrong
✗ Migration did not include index creation
✗ No performance test gate validates query performance before deploy
✗ Connection pool size was never revisited after DB was scaled up
## Action items
| Action | Owner | Due |
|---------------------------------------------|-----------|------------|
| Add index to orders.customer_id | @dev-team | 2026-05-16 |
| Add query performance check to CI pipeline | @platform | 2026-05-18 |
| Review and update connection pool sizes | @infra | 2026-05-18 |
| Create runbook for DB connection exhaustion | @on-call | 2026-05-19 |
## Lessons learned
- Database migrations must be reviewed for index coverage
- Performance regression tests should run against staging pre-deploy
- Connection pool size should be proportional to pod count
Blameless means blameless
A post-mortem that names individuals as root causes ("engineer X forgot to add the index") is not blameless and is not useful. Systems cause incidents, not people. The question is always: what system change would have prevented this?
Standard 12: Code Review Standards¶
Code review is not about finding bugs (tests do that) — it's about knowledge sharing, consistency, and design feedback.
The Author's Responsibilities¶
BEFORE requesting review:
✓ Self-review your own diff first
✓ PR description explains WHY (not just what — git diff shows that)
✓ All CI checks pass — never request review on a failing PR
✓ PR is small enough to review in < 30 minutes
✓ Tests are included and pass
✓ Linked to the Jira/Linear ticket
PR description template:
## What
[1-2 sentences on what this change does]
## Why
[The problem being solved — link to ticket]
## How to test
[Steps to verify the change works]
## Screenshots / Evidence
[Before/after for UI; log output for backend]
The Reviewer's Responsibilities¶
REVIEW within:
✓ 4 hours for unblocking changes (SEV-2+)
✓ 24 hours for standard changes
REVIEW for:
✓ Correctness: does it do what it says?
✓ Tests: are the right scenarios covered?
✓ Design: is this the right abstraction?
✓ Security: any obvious vulnerabilities?
✓ Observability: are errors logged and metrics emitted?
NOT responsible for:
✗ Style (automated by linting)
✗ Formatting (automated by formatters)
✗ Finding all bugs (tests do that)
PR Size Standards¶
IDEAL PR SIZE: < 400 lines changed
> 400 lines: Harder to review, misses more issues
> 800 lines: Review quality degrades significantly
> 1000 lines: "LGTM" reviews — nobody is reading this
If your PR is large, split it:
✦ Refactor in one PR (no behaviour change)
✦ Feature in a second PR (behaviour added)
A PR that says "refactor + feature" in the title
should be two PRs.
Standard 13: Collaboration and Knowledge Management¶
Communication Standards¶
SYNCHRONOUS (real-time):
✦ Slack / Teams — for time-sensitive, conversational
✦ Video calls — for decisions, complex discussions
✦ Do NOT use email for engineering team communication
ASYNCHRONOUS (not real-time):
✦ Jira / Linear — for task tracking and sprint management
✦ Confluence / Notion — for documentation and decisions
✦ GitHub/GitLab — for code reviews and technical discussion
✦ Post-mortems — for incident learnings
CHANNEL NAMING CONVENTION (Slack):
#team-[team-name] — team's primary channel
#service-[service-name] — alerts and deployments for a service
#incident-[date]-[name] — per-incident channels (auto-archived)
#deploy-[env] — deployment notifications
#alerts-[severity] — monitoring alerts by severity
Architecture Decision Records (ADRs)¶
Every significant technical decision must be documented as an ADR — a short, dated record that captures the decision, the context, and the rationale:
# ADR-0023: Use PostgreSQL for the Orders Service Database
Date: 2026-05-18
Status: Accepted
Deciders: [Team leads]
## Context
The Orders service needs a persistent store for order data.
Requirements: ACID transactions, complex queries, JSON support,
managed service availability on AWS.
## Decision
Use PostgreSQL (via AWS RDS) as the primary database.
## Alternatives considered
- MySQL: Less powerful JSON support, no native arrays
- MongoDB: Lacks ACID transactions across documents; operational complexity
- DynamoDB: Poor fit for complex relational queries; high cost at scale
## Consequences
+ Full ACID compliance for order state transitions
+ Rich query capabilities for reporting
+ Mature ecosystem, team familiarity
- Vertical scaling limits (mitigated by read replicas)
- Schema migrations require careful planning
Store ADRs in the repository at docs/adr/ — they are versioned alongside the code they describe.
The Complete Engineering Standards Checklist¶
Use this as a new service launch checklist and quarterly audit:
- Trunk-based development — branches merged within 2 days
- Pre-commit hooks installed (lint, format, secret scan)
- Conventional Commits enforced via commitlint
- Branch protection on
main— required reviews + passing CI -
.gitignoreincludes.env, credentials, build artifacts
- Every commit triggers CI pipeline
- Pipeline stages: lint → test → security → build → deploy
- No manual steps in staging deployment
- Production deployment requires approval gate
- Rollback procedure documented and tested
- Build artifacts are immutable and tagged with git SHA
- Unit test coverage ≥ 80%
- Integration tests run in CI with real backing services
- Contract tests for all service-to-service dependencies
- Performance tests run on staging before production
- Tests run in < 10 minutes (fast feedback)
- SAST (Semgrep/SonarQube) runs on every PR
- Dependency scanning (pip-audit/Dependabot) enabled
- Secret scanning enabled (detect-secrets + GitHub)
- Container image scanned (Trivy) before push
- DAST (OWASP ZAP) runs against staging weekly
- No HIGH/CRITICAL CVEs in production
- Structured JSON logging to stdout
- Correlation ID propagated across all services
- Prometheus metrics endpoint at
/metrics - Grafana dashboard: RED metrics + business metrics
- SLOs defined and error budget tracked
- Alerts configured for SLO breaches
- Distributed traces with OpenTelemetry
- Resource requests AND limits set on all containers
- Liveness, readiness, and startup probes configured
- Minimum 3 replicas in production
- Pod anti-affinity rule to spread across nodes
- HPA configured for auto-scaling
- Non-root user in container
- Image pinned to exact SHA (not
latest)
- README with: purpose, setup, architecture diagram
- Runbook for all alert conditions
- ADR for all significant design decisions
- On-call rotation documented
- Post-mortem for every SEV-1 and SEV-2
Summary — The Thirteen Standards¶
| # | Standard | Core Tool | What It Prevents |
|---|---|---|---|
| 1 | 12-Factor App | Config via env vars, stdout logs | Environment-specific builds, log sprawl |
| 2 | Trunk-Based Dev | Feature toggles, short branches | Merge conflicts, integration hell |
| 3 | Pre-Commit Hooks | pre-commit, detect-secrets | Bad code and secrets reaching CI |
| 4 | CI/CD Pipeline | GitHub Actions, Argo CD | Manual, error-prone deployments |
| 5 | Testing Pyramid | pytest, Pact | Bugs in production, slow feedback |
| 6 | DevSecOps | Semgrep, Trivy, OWASP ZAP | Security vulnerabilities reaching users |
| 7 | Semantic Versioning | semantic-release | Breaking changes without warning |
| 8 | Observability | Prometheus, Loki, Jaeger | Blind operations, slow incident response |
| 9 | Kubernetes Standards | Resource limits, probes, HPA | Outages, poor scaling, security gaps |
| 10 | GitOps | Argo CD | Manual drift, unaudited changes |
| 11 | Incident Management | Runbooks, post-mortems | Repeated incidents, slow recovery |
| 12 | Code Review | PR templates, size limits | Knowledge silos, poor design |
| 13 | Collaboration | ADRs, Slack standards | Decision loss, tribal knowledge |
Standards compound. A team that applies all thirteen doesn't just add the benefits — it multiplies them. Reliable CI makes security scanning trustworthy. GitOps makes observability meaningful. Blameless post-mortems make runbooks better. The whole is greater than the sum of its parts.
Essential reading for going deeper: The Phoenix Project and The DevOps Handbook by Gene Kim et al., Accelerate by Nicole Forsgren et al., Site Reliability Engineering by Google, and Team Topologies by Skelton & Pais.
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.