What Is DevOps? The Complete Guide¶
Ask ten engineers what DevOps is and you'll get ten different answers.
"It's CI/CD." "It's automation." "It's when developers do operations." "It's a culture." "It's a job title." "It's a toolchain." "It's a movement."
They're all partially right. And because DevOps touches culture, process, and technology simultaneously, it resists simple definition.
Part 1: The Problem DevOps Was Born to Solve¶
To understand DevOps, you have to understand the world before it.
The Wall of Confusion¶
In the early 2000s, a typical software company was organised like this:
┌────────────────────────────────────────────────────────────┐
│ DEVELOPMENT TEAM │
│ │
│ "Our job is to build features. Ship fast, move fast." │
│ Incentivised on: features shipped, velocity │
└─────────────────────────────┬──────────────────────────────┘
│
"THROW IT OVER THE WALL"
│
▼
┌────────────────────────────────────────────────────────────┐
│ OPERATIONS TEAM │
│ │
│ "Our job is to keep the system stable. Change is risk." │
│ Incentivised on: uptime, stability, zero incidents │
└────────────────────────────────────────────────────────────┘
Development wanted change. Operations wanted stability. These goals were structurally opposed.
When a deployment failed at 2am, Operations blamed Development for shipping bad code. Development blamed Operations for having a fragile environment. Nobody owned the full pipeline. Nobody was accountable for the outcome.
The result: deployments happened every few months, every deployment was terrifying, and the gap between code written and value delivered could be measured in quarters.
The Agile Paradox¶
The Agile movement (2001) solved the development side of this problem. Teams began delivering working software every two weeks instead of every six months. But there was a catch: Agile made the wall of confusion worse, not better.
Now developers were shipping code every two weeks — to the wall. Operations teams, still running on quarterly change management processes, couldn't absorb that cadence. Code stacked up. Features were ready but not deployed. The value was trapped.
Agile without DevOps is a fast car with no road.
The Origin Story — 10+ Deploys Per Day¶
The turning point came in June 2009 at the O'Reilly Velocity Conference. John Allspaw and Paul Hammond from Flickr gave a talk titled "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr."
The thesis: Development and Operations don't have to be adversaries. When they share goals, tools, and accountability, they can deploy to production more than ten times a day — safely.
The talk went viral in the engineering community. It validated what many practitioners suspected: the wall of confusion was not inevitable. It was a choice.
That same year, Patrick Debois — a Belgian IT consultant who had been frustrated by the Dev/Ops divide for years — was inspired by the talk to organise the first DevOpsDays conference in Ghent, Belgium. He needed a Twitter hashtag. #DevOpsDays was too long. He shortened it to #DevOps.
A movement had a name.
The Phoenix Project (2013)¶
Gene Kim, Kevin Behr, and George Spafford's novel The Phoenix Project brought DevOps to the mainstream business audience. It's a fictional story of an IT manager at a failing company, but the problems it depicts — siloed teams, endless firefighting, failed deployments, business pressure — were immediately recognisable to anyone in technology.
The book introduced the concept of the Three Ways — the foundational principles underlying all DevOps practices. We'll cover those in depth shortly.
Part 2: What DevOps IS — and What It Isn't¶
The Misconceptions¶
| What people think DevOps means | What it actually means |
|---|---|
| A team called "the DevOps team" | A cultural approach that all teams adopt |
| A set of automation tools | Tools in service of culture and process |
| Developers doing operations work | Shared ownership of the full delivery cycle |
| Just CI/CD | CI/CD is one practice; DevOps is the philosophy |
| Something you "implement" in 3 months | A continuous improvement journey |
| Eliminating the Ops role | Evolving what Ops means (towards SRE/Platform) |
The most dangerous misconception
Creating a "DevOps team" is the most common way organisations cargo-cult DevOps without achieving it. A DevOps team becomes just another silo — the team responsible for the pipeline — while development and operations remain separate and disconnected. You cannot outsource a culture to a team.
What DevOps Actually Is¶
DevOps is the union of people, process, and technology to continuously deliver value to customers.
More precisely, it is a set of practices, cultural norms, and enabling technologies that allow an organisation to:
- Deliver software faster — from code to production in hours, not months
- Deliver software more reliably — fewer failures, faster recovery
- Iterate based on feedback — learn from real usage, not assumptions
- Reduce toil — automate repetitive work so humans do human things
THE DEVOPS INFINITY LOOP
PLAN ──────── CODE
╱ ╲
MONITOR BUILD
│ │
OPERATE TEST
╲ ╱
DEPLOY ──── RELEASE
← Continuous everything →
Planning, Integration, Testing,
Delivery, Monitoring, Feedback
Every phase feeds the next. Monitoring informs planning. Testing gates releases. Deployment triggers monitoring. The loop never stops — and that's the point.
Part 3: The Three Ways — DevOps Foundational Principles¶
Gene Kim's Three Ways are the conceptual foundation from which all DevOps practices derive. Understand these, and every DevOps practice makes intuitive sense.
The First Way: Flow¶
"Maximise the flow of work from Development through Operations to the customer."
The First Way is about left-to-right flow — how fast can an idea travel from conception to the hands of a customer?
Every obstacle to that flow is waste:
THE DEVOPS VALUE STREAM
Idea → Code → Review → Build → Test → Stage → Deploy → Customer
FLOW BLOCKERS (measure and eliminate these):
✗ Manual approval gates that sit for days
✗ Handoffs between teams (with queues)
✗ Environments that don't match production
✗ Large batch deployments (big bang releases)
✗ Long-running branches that cause merge conflicts
✗ Tests that take 2 hours to run
First Way practices: - Small batch sizes — commit frequently, deploy frequently - Limit work in progress — stop starting, start finishing - Make every deployment small, routine, and low-risk - Continuous Integration — integrate at least daily - Continuous Delivery — always have a deployable artifact
The Second Way: Feedback¶
"Create fast feedback loops from Operations back to Development."
The Second Way is about right-to-left feedback — how fast does information about production reality reach the people who can act on it?
FEEDBACK LOOPS (shorter = better)
Production incident → Development team
Deployment failure → Developer who caused it
Performance degradation → Team responsible for the service
Security vulnerability → Developer who introduced it
User behaviour → Product team and developers
FAST FEEDBACK:
Alert fires → developer sees it → fixes it → ships fix (minutes to hours)
SLOW FEEDBACK:
Customer complains → support ticket → triage meeting → sprint planning →
development → testing → deployment (weeks to months)
Second Way practices: - Monitoring and alerting on every service - Telemetry: logs, metrics, traces in every application - Peer review (code review is a feedback loop) - Automated testing (the fastest feedback: tests catch bugs before humans do) - Blameless post-mortems (feedback on process, not people)
The Third Way: Continual Learning and Experimentation¶
"Create a culture of high trust, risk-taking, and continual improvement."
The Third Way recognises that DevOps is not a destination — it's a practice of never stopping improvement. High-performing teams don't just run the system; they continuously improve the system.
THE IMPROVEMENT KATA (from Toyota)
Current State → Target State
│ │
└── Experiment ──┘
│
Learn
│
New Current State → New Target State → ...
Third Way practices: - Retrospectives — regular, structured reflection on what to improve - Blameless post-mortems — learning from incidents without blame - Chaos engineering — deliberately introducing failures to build resilience - Innovation time — dedicated time for teams to experiment - Game days — rehearsed incident response
Part 4: CALMS — The Five Pillars of DevOps Culture¶
CALMS is the cultural framework for DevOps. It was popularised by Jez Humble and John Willis, and provides the most complete answer to "what does DevOps culture actually look like?"
C — Culture: People and Process Before Tools¶
Culture is the hardest and most important pillar. No tool or automation can overcome a broken culture.
DevOps culture has specific, observable characteristics:
DEVOPS CULTURE INDICATORS
✓ Shared goals: Dev and Ops measured on the same outcomes
(deployment frequency, MTTR) — not siloed metrics
✓ Psychological safety: engineers can raise concerns and admit
mistakes without fear of punishment
✓ Blameless mindset: incidents are system failures, not person failures
✓ You build it, you run it: teams own their services in production
(no "throw it over the wall")
✓ Empathy across roles: developers understand operational constraints;
operators understand development pressures
The culture test: After a production incident, what happens? If people hide it, downplay it, or point fingers — you have a fear culture. If people write a transparent post-mortem, share what they learned, and make systemic improvements — you have a DevOps culture.
Culture transforms at Etsy
Etsy (the e-commerce marketplace) is one of the earliest DevOps success stories. In 2009, they deployed to production once a week, with a team of eight engineers managing a three-hour, high-stress process. By 2011, they were deploying 25+ times per day, with any engineer able to deploy independently.
The transformation started with culture: blameless post-mortems, shared responsibility, and psychological safety. The tools followed the culture — not the other way around.
A — Automation: Eliminate Toil¶
Toil is work that is manual, repetitive, automatable, tactical, and has no enduring value. Toil is the enemy of high-performing engineering teams.
THE TOIL TEST
Is this work:
✓ Manual (a human does it every time)?
✓ Repetitive (done over and over)?
✓ Automatable (a computer could do it)?
✓ Tactical (not improving anything)?
✓ Without enduring value (no lasting outcome)?
If yes to all five → this is toil. Automate it.
What to automate in DevOps:
| Category | What to automate |
|---|---|
| Code quality | Linting, formatting, type checking (runs on save or commit) |
| Testing | All test suites (unit, integration, E2E) |
| Security | SAST, dependency scanning, secret detection |
| Build | Compilation, packaging, Docker image build |
| Deployment | All deployments, zero manual kubectl apply |
| Infrastructure | Provisioning via Terraform, Pulumi |
| Monitoring | Alert creation, dashboard provisioning |
| Incident response | Runbook execution for known failure modes |
Google's SRE book recommends: SRE teams should spend no more than 50% of their time on toil. The rest must go to engineering work that reduces future toil.
L — Lean: Small Batches, Reduce Waste¶
Lean in DevOps comes directly from Toyota's Production System — the same philosophy that influenced the Agile movement.
The core lean insight for software: Large batches are the enemy. Large code changes, large releases, large deployments — all increase risk and slow feedback.
LARGE BATCH vs SMALL BATCH
LARGE BATCH (traditional):
2 months of work → 1 giant release → 50 bugs → 2-week hotfix → repeat
SMALL BATCH (DevOps):
2 hours of work → deploy → 0-1 bugs → fix in minutes → deploy again
The math: 10 small releases have LESS total risk than 1 large release.
Risk grows super-linearly with batch size.
Lean practices in DevOps: - Work In Progress (WIP) limits — stop starting, start finishing - Value Stream Mapping — visualise and eliminate waste in the pipeline - Single-piece flow — each change moves through the pipeline independently - Eliminating inventory — undeployed code is inventory with no value
M — Measurement: Data-Driven Everything¶
In DevOps, opinions are replaced by measurements. You can't improve what you don't measure.
What to measure:
The measurement anti-pattern: Vanity metrics. Lines of code written, story points completed, tickets closed — these numbers can all go up while the system gets worse. Measure outcomes, not outputs.
S — Sharing: Knowledge is a Team Asset¶
High-performing DevOps teams treat knowledge as infrastructure — it must be maintained, versioned, and accessible to everyone.
What sharing looks like:
- Blameless post-mortems published company-wide — so everyone learns from every incident
- Runbooks for every service — so any on-call engineer can respond, not just the original author
- Architecture Decision Records — so design decisions don't live in one person's head
- Communities of Practice — cross-team groups sharing expertise in specific areas
- InnerSource — treating internal repositories like open source (anyone can contribute, anyone can see)
- Game days — rehearsed incident scenarios so the whole team can respond, not just senior engineers
Part 5: The DevOps Lifecycle in Depth¶
The DevOps infinity loop has eight phases. Here's what each means in practice.
Plan¶
TOOLS: Jira, Linear, GitHub Issues, Confluence, Miro
PRACTICES:
✦ OKRs define the quarter's objectives and measurable outcomes
✦ Sprint planning breaks OKRs into deliverable stories
✦ Architecture Decision Records (ADRs) document key design choices
✦ Definition of Ready — stories must meet criteria before sprint entry
THE DEVOPS DIFFERENCE:
Operations joins planning. If a feature will require new infrastructure,
on-call rotation changes, or monitoring additions, Ops knows before coding starts.
Code¶
TOOLS: VS Code, JetBrains, GitHub Copilot, Git
PRACTICES:
✦ Trunk-Based Development — short-lived branches, daily merges
✦ Pair programming / mob programming for complex work
✦ Pre-commit hooks — lint, format, secret scan before commit
✦ Conventional Commits — structured commit messages for automation
THE DEVOPS DIFFERENCE:
Developers write infrastructure code (Terraform, Helm charts) alongside
application code. "Works on my machine" is eliminated by shared
Docker Compose environments that mirror production exactly.
Build¶
TOOLS: GitHub Actions, GitLab CI, Jenkins, CircleCI, Buildkite
WHAT HAPPENS ON EVERY COMMIT:
✦ Dependency installation (from lock file)
✦ Compilation / bundling
✦ Static analysis (linting, type checking)
✦ Unit tests (< 5 minutes total)
✦ SAST security scan
✦ Docker image build
THE DEVOPS DIFFERENCE:
Build artifacts are immutable. The same Docker image SHA that passed
tests is the exact image that goes to production — not rebuilt, not modified.
"It works in staging" is guaranteed because the artifact is identical.
Test¶
THE TESTING PORTFOLIO
Unit Tests → ms, run every commit, developer feedback loop
Integration Tests → seconds, run every commit, service boundary validation
Contract Tests → minutes, validate API contracts between services
E2E Tests → minutes, validate critical user journeys
Performance Tests → minutes, catch regressions before production
Security Tests → minutes, DAST against staging environment
Chaos Tests → ongoing, verify resilience to failures
GOAL: < 10 minute total pipeline. If tests take longer, team stops
running them. Fast feedback > comprehensive coverage.
Release¶
THE THREE DEPLOYMENT STRATEGIES
┌─────────────────────────────────────────────────────────────┐
│ BLUE-GREEN DEPLOYMENT │
│ │
│ Blue (v1) ←── 100% traffic Green (v2) ←── 0% traffic │
│ │
│ Deploy v2 to Green → Run smoke tests → Switch traffic → │
│ Green now gets 100%. Blue kept as rollback. │
│ │
│ Rollback: switch traffic back to Blue (< 1 minute) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ CANARY DEPLOYMENT │
│ │
│ v1: 100% → v2: 5% of traffic → monitor → v2: 25% → │
│ monitor → v2: 50% → monitor → v2: 100% │
│ │
│ Real users validate. Errors above threshold = auto-rollback │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ FEATURE FLAGS │
│ │
│ Code is deployed but feature is disabled. │
│ Enable for: internal users → beta users → % of users → all │
│ │
│ Decouple deployment (technical) from release (business) │
└─────────────────────────────────────────────────────────────┘
Deploy¶
TOOLS: Kubernetes, Argo CD, Helm, Terraform
GITOPS PRINCIPLE:
✦ All deployments are triggered by git commits — no manual kubectl
✦ The git repository is the single source of truth for cluster state
✦ Argo CD watches the repo and reconciles the cluster to match
✦ Every deployment is auditable: who changed what, when, why
ZERO-DOWNTIME DEPLOYMENT REQUIREMENTS:
✦ Rolling updates with maxUnavailable=0
✦ SIGTERM handling — app finishes in-flight requests before shutdown
✦ Health checks — Kubernetes waits for readiness before routing traffic
✦ Graceful shutdown period — 30 seconds to drain connections
Operate¶
THE SRE (SITE RELIABILITY ENGINEERING) APPROACH:
✦ SLOs (Service Level Objectives) define "good enough" uptime
✦ Error budgets quantify acceptable unreliability
✦ On-call rotations distribute the operational burden
✦ Runbooks document response procedures for known failure modes
✦ Capacity planning ensures resources ahead of demand
"YOU BUILD IT, YOU RUN IT":
Development teams are on-call for their own services. This creates
powerful feedback: if your service pages you at 3am, you fix the root
cause — you don't just restart the pod and go back to sleep.
Monitor¶
THE THREE PILLARS OF OBSERVABILITY:
LOGS → What happened? (events, errors, state changes)
METRICS → How is it performing? (rates, counts, latencies)
TRACES → Why is it slow? (distributed call paths)
ALERTING PHILOSOPHY:
✦ Alert on symptoms, not causes (alert on high error rate, not "CPU high")
✦ Every alert must be actionable — if you can't do anything, don't alert
✦ Alert fatigue kills on-call culture — fewer, better alerts
✦ SLO-based alerting: alert when error budget is burning too fast
Part 6: CI/CD — The Engine of DevOps¶
Continuous Integration and Continuous Delivery/Deployment are the most concrete expressions of DevOps practices.
Continuous Integration (CI)¶
CI means every developer integrates their code with the main branch at least once per day, and every integration triggers an automated build and test run.
WITHOUT CI: WITH CI:
Developer A works for 2 weeks Developer A integrates daily
Developer B works for 2 weeks Developer B integrates daily
Merge day → 3 days of conflicts Merge day → 30 minutes, if any
"Integration hell" → common Conflicts caught within hours
and expensive of introduction → cheap to fix
The CI contract:
CI RULES (non-negotiable):
1. The build must not be broken. Ever.
If you break it, you fix it immediately — before doing anything else.
2. If the build fails, everyone stops.
A broken build is the team's highest priority. Nobody starts new work
while the build is red.
3. Commit at least daily.
Long-lived branches undermine CI. If you haven't integrated today,
you haven't integrated.
4. The build must be fast.
> 10 minutes = developers stop running it. Target < 5 minutes.
Continuous Delivery vs Continuous Deployment¶
These are often confused. They are different:
CONTINUOUS DELIVERY:
Every change is always in a deployable state.
Deployment to production is a BUSINESS decision — triggered by humans
when the business is ready.
CONTINUOUS DEPLOYMENT:
Every change that passes automated tests is AUTOMATICALLY deployed
to production. No human approval step.
CI CD (Delivery) CD (Deployment)
Code → Build → Test → Staging → [Human approves] → Production
OR
Code → Build → Test → Staging → [Automated] ──────→ Production
Which should you use?
Continuous Deployment is the gold standard — it forces small changes, fast feedback, and confidence in automation. But it requires:
- Very high automated test coverage
- Feature flags to control who sees new features
- Excellent monitoring and alerting
- Mature rollback procedures
Most teams start with Continuous Delivery (human approval to production) and evolve toward Continuous Deployment as confidence grows.
Part 7: DORA Metrics — Measuring DevOps Performance¶
The DORA (DevOps Research and Assessment) research programme identified four metrics that best predict software delivery performance and organisational outcomes.
THE FOUR DORA METRICS
┌─────────────────────────────────────────────────────────────────────┐
│ METRIC │ WHAT IT MEASURES │ TARGET (Elite) │
├─────────────────────┼───────────────────────────┼────────────────────┤
│ Deployment │ How often you deploy to │ Multiple times │
│ Frequency │ production │ per day │
├─────────────────────┼───────────────────────────┼────────────────────┤
│ Lead Time for │ Commit → running in │ Less than │
│ Changes │ production │ one hour │
├─────────────────────┼───────────────────────────┼────────────────────┤
│ Change Failure │ % of deploys that cause │ 0-5% │
│ Rate │ a production incident │ │
├─────────────────────┼───────────────────────────┼────────────────────┤
│ Mean Time to │ How fast you recover from │ Less than │
│ Restore (MTTR) │ a production incident │ one hour │
└─────────────────────┴───────────────────────────┴────────────────────┘
Why these four? Because they capture the tension between speed and stability that defines delivery performance:
- Deployment Frequency + Lead Time = speed (throughput)
- Change Failure Rate + MTTR = stability (quality and resilience)
Crucially, DORA research shows speed and stability are not trade-offs. Elite performers are both faster AND more stable than low performers. Practices that improve stability (automated testing, small batches, good monitoring) also improve speed — because less time is spent on rework, incidents, and firefighting.
DORA PERFORMANCE BANDS (2023):
┌──────────────────────┬────────────┬────────────┬────────────┬──────────┐
│ Metric │ Elite │ High │ Medium │ Low │
├──────────────────────┼────────────┼────────────┼────────────┼──────────┤
│ Deployment Frequency │ Multiple/ │ Daily– │ Weekly– │ < Monthly│
│ │ day │ weekly │ monthly │ │
├──────────────────────┼────────────┼────────────┼────────────┼──────────┤
│ Lead Time │ < 1 hour │ 1 day– │ 1 week– │ > 6 mos │
│ │ │ 1 week │ 1 month │ │
├──────────────────────┼────────────┼────────────┼────────────┼──────────┤
│ Change Failure Rate │ 0–5% │ 0–15% │ 16–30% │ 16–30% │
├──────────────────────┼────────────┼────────────┼────────────┼──────────┤
│ MTTR │ < 1 hour │ < 1 day │ 1 day– │ > 6 mos │
│ │ │ │ 1 week │ │
└──────────────────────┴────────────┴────────────┴────────────┴──────────┘
Part 8: The DevOps Toolchain¶
DevOps is culture-first — but culture needs tools to express itself. Here's the standard toolchain by category:
DEVOPS TOOLCHAIN MAP
PLAN CODE BUILD TEST
───────── ──────── ────── ──────
Jira Git GitHub Actions pytest
Linear GitHub GitLab CI Jest
Confluence GitLab Jenkins Selenium
Miro VS Code CircleCI k6 (perf)
Notion JetBrains Buildkite OWASP ZAP
RELEASE DEPLOY OPERATE MONITOR
────────── ────── ────────── ────────
Argo CD Kubernetes PagerDuty Prometheus
Spinnaker Helm Opsgenie Grafana
LaunchDarkly Terraform VictorOps Loki
(feature flags)Pulumi Runbooks Jaeger
Crossplane OpenTelemetry
Datadog
The golden rule of DevOps tooling: Tools serve culture. Adopt tools that enforce the practices you want — automated testing, infrastructure as code, observability. Don't adopt tools that exist to make bad processes faster.
Part 9: SRE — Google's Answer to "Who Runs It?"¶
Site Reliability Engineering (SRE) is Google's implementation of DevOps principles — first developed in 2003 by Ben Treynor Sloss, and documented in Google's SRE Book (2016).
Where DevOps is a philosophy, SRE is an opinionated implementation with specific practices and roles.
The Key SRE Concepts¶
SLI — Service Level Indicator A quantitative measure of service behaviour:
SLI examples:
✦ Availability: % of requests returning 2xx in the last 30 days
✦ Latency: % of requests completing in < 200ms
✦ Throughput: requests processed per second
✦ Durability: % of stored data successfully retrieved
SLO — Service Level Objective The target value for an SLI:
SLO examples:
✦ 99.9% of requests return 2xx over a 30-day window
✦ 95% of requests complete in < 200ms
✦ 99.999% data durability
SLA — Service Level Agreement The contractual commitment to customers (usually less strict than the SLO):
SLA: "We guarantee 99.5% uptime"
SLO: "We target 99.9% uptime internally"
SLI: "Current measured uptime: 99.97%"
The gap between SLO and SLA is your safety buffer.
Error Budget The amount of unreliability allowed before the SLO is breached:
SLO: 99.9% availability over 30 days
Error Budget: 0.1% of 30 days = 43.8 minutes of allowable downtime
Error Budget Policy:
✦ Budget is plentiful → deploy freely, take risks, experiment
✦ Budget is running low → slow deployments, focus on reliability
✦ Budget is exhausted → feature freeze until reliability improves
The error budget creates a shared incentive between product and engineering: product wants fast features (which cost error budget), engineering wants reliability (which requires error budget). They must negotiate — using data, not politics.
Toil Reduction — The SRE Mission¶
Google's SRE Book defines a specific goal: SRE teams should spend no more than 50% of their time on toil. The other 50% must go to engineering projects that reduce future toil.
THE TOIL → AUTOMATION CYCLE
Identify toil (manual, repetitive work)
↓
Estimate time cost per week
↓
Engineer automation solution
↓
Toil is eliminated
↓
Time freed up for more automation
↓ (and repeat)
This is how SRE teams achieve more reliability with fewer people over time — not through heroics, but through systematic toil elimination.
Part 10: DevOps Extensions — The Family Grows¶
DevOps spawned several extensions as its principles were applied to adjacent domains:
DevSecOps — Security Shifts Left¶
TRADITIONAL SECURITY:
Design → Code → Build → Test → Stage → Deploy → SECURITY REVIEW → Production
↑
Security is the last gate.
Vulnerabilities found here
cost 100× more to fix.
DEVSECOPS:
[Security from day 1]
Design → Code → Build → Test → Stage → Deploy → Production
↑ ↑ ↑ ↑ ↑ ↑
Threat Secret SAST DAST Pen Runtime
model scan scan scan test protection
DevSecOps integrates security practices at every stage of the DevOps lifecycle:
| Stage | Security Practice |
|---|---|
| Plan | Threat modelling, security requirements |
| Code | SAST (Semgrep, SonarQube), secret detection (detect-secrets) |
| Build | Dependency scanning (OWASP, Snyk), container scanning (Trivy) |
| Test | DAST (OWASP ZAP), penetration testing |
| Deploy | Policy as code (OPA, Kyverno), image signing |
| Operate | Runtime security (Falco), WAF |
| Monitor | Security information and event management (SIEM) |
FinOps — Cloud Cost as Engineering Practice¶
THE FINOPS PROBLEM:
Cloud enables any engineer to provision any resource, instantly.
Without visibility and accountability, cloud bills spiral.
"We're not sure what's costing so much" → $500K surprise bill
FinOps is the practice of bringing financial accountability to the variable-spend cloud model:
FINOPS LIFECYCLE
INFORM OPTIMISE OPERATE
───────── ──────── ───────
Visibility: Right-sizing: Tagging policy:
Know what you Match resource Every resource tagged
spend and why to actual need by team, service, env
Cost allocation: Reserved capacity: Budgets and alerts:
Costs attributed Commit for Alert before
to teams/services discounts overspend, not after
Anomaly detection: Auto-scaling: Chargebacks:
Spot unusual Scale down when Teams see their
cost spikes fast load drops cloud bill
Key FinOps metrics:
✦ Cost per deploy — is shipping getting cheaper or more expensive?
✦ Cost per customer — unit economics at cloud scale
✦ Rightsizing score — what % of resources are appropriately sized?
✦ Reserved coverage — what % of baseline is on committed pricing?
✦ Waste ratio — idle or unused resources as % of total spend
GitOps — Infrastructure as Git¶
GITOPS PRINCIPLE:
Git is the single source of truth for:
✦ Application code
✦ Infrastructure configuration (Terraform)
✦ Kubernetes manifests (Helm, Kustomize)
✦ CI/CD pipeline definitions
All changes go through git (PR → review → merge).
No manual console clicks in production. Ever.
Argo CD / Flux watches git → applies changes to cluster automatically.
If someone manually changes something in the cluster → auto-reverted.
MLOps — DevOps for Machine Learning¶
THE ML PROBLEM:
Data science teams build models in Jupyter notebooks.
Models stay in notebooks. Never reach production.
When they do reach production, nobody can reproduce them.
When data drifts, nobody knows the model is degrading.
MLOPS PRACTICES:
✦ Experiment tracking (MLflow, Weights & Biases)
✦ Model versioning (DVC, Hugging Face)
✦ Automated retraining pipelines
✦ Model monitoring (data drift, performance decay)
✦ Feature stores (Feast, Tecton)
✦ A/B testing for model versions
Platform Engineering — DevOps at Scale¶
When DevOps practices must scale across 50+ teams, Platform Engineering emerges as the discipline:
THE PLATFORM ENGINEERING MODEL
Platform Team builds:
┌─────────────────────────────────────────────────────────────┐
│ INTERNAL DEVELOPER PLATFORM (IDP) │
│ │
│ Self-service deployment Standardised observability │
│ Automated provisioning Golden path CI/CD templates │
│ Internal service catalog Secrets management │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
Product Team A Product Team B
(consumes platform, (consumes platform,
focuses on product) focuses on product)
"Pave the golden path, don't mandate the only path."
Platform Engineering is the answer to "how do we scale DevOps beyond 10 teams?" Instead of each team reinventing CI/CD, observability, and deployment practices, the Platform team builds shared infrastructure that makes the right way the easy way.
Part 11: Real-World DevOps — How It Works in Practice¶
Amazon: "You Build It, You Run It"¶
Amazon's DevOps transformation is one of the most documented in the industry. In 2001, Amazon was a monolith that deployed every 11.6 seconds — accidentally, during a massive re-architecture.
By 2011, Amazon had: - Decomposed into hundreds of microservices (each owned by a two-pizza team) - Each team fully responsible for development, deployment, and operations - 23,000 deployments per day across all services
The key policy: "You build it, you run it." If your service pages you at 3am, you fix it. This creates a powerful incentive to build well-monitored, easily-debugged, resilient services.
Netflix: Chaos Engineering¶
Netflix's DevOps philosophy extended to deliberately breaking their own systems. Chaos Engineering — the practice of injecting failures into production to verify resilience — was pioneered by Netflix's Chaos Monkey (2011).
CHAOS MONKEY: randomly terminates production instances
CHAOS GORILLA: terminates an entire AWS Availability Zone
CHAOS KONG: simulates an entire AWS region failure
The philosophy: if you don't test for failure, you'll be surprised by it.
If you regularly test for failure, you build systems that survive it.
Netflix deploys hundreds of times per day. They achieve this with: - Comprehensive automated testing - Circuit breakers and fallbacks built into every service - Active chaos engineering to verify resilience assumptions - A blameless culture where engineers who catch problems are celebrated
ING Bank: Enterprise DevOps¶
Not just tech companies can do DevOps. ING, the Dutch bank, is one of the most cited enterprise DevOps transformations.
In 2015, ING reorganised from traditional functional silos into "squads" — small, cross-functional teams (like Spotify's model) that owned specific product areas end-to-end, including operations.
Results after 3 years: - Deployment frequency increased from monthly to multiple times per day - Time to market for new features reduced by 70% - Application availability improved from 98.5% to 99.9% - Employee engagement scores rose (teams preferred the new model)
The lesson: DevOps is not only for Silicon Valley startups. Regulated industries — banks, healthcare, insurance — can and do adopt DevOps successfully.
Part 12: How to Start a DevOps Transformation¶
Most DevOps transformations fail. Not because the ideas are wrong, but because of implementation mistakes:
Common Failure Patterns¶
❌ FAILURE: "We hired a DevOps team"
Why it fails: Creates a new silo. Other teams still throw code over a wall.
Fix: Embed DevOps practices in every team. Platform team enables, not executes.
❌ FAILURE: "We bought the tools (Jenkins, Docker, Kubernetes)"
Why it fails: Tools without culture produce automated bad processes.
Fix: Culture and process first. Tools should enforce the culture.
❌ FAILURE: "We told everyone they're doing DevOps now"
Why it fails: Decree without support doesn't change behaviour.
Fix: Start with a willing pilot team. Show results. Expand with proof.
❌ FAILURE: "We did a big-bang transformation"
Why it fails: Big changes create resistance and backslide.
Fix: Small, continuous improvements. One practice at a time.
❌ FAILURE: "We skipped the culture work"
Why it fails: Without psychological safety, people won't risk new practices.
Fix: Blameless post-mortems. Celebrate learning. Reward transparency.
The Three-Step Transformation Approach¶
Step 1: Value Stream Mapping (weeks 1-4)
Before changing anything, understand your current state. Map every step from "developer commits code" to "user gets value":
VALUE STREAM MAP EXERCISE:
Ask for each step:
✦ How long does this step take?
✦ What's the wait time before this step starts?
✦ What's the error/rework rate at this step?
✦ Who does this? Can it be automated?
Example findings:
Code review: 2 days wait, 4 hours work
QA testing: 3 days wait, 1 day work
Change board: 5 days wait, 1 hour work
Deployment: 1 week wait, 4 hours work (manual, error-prone)
Total lead time: 3 weeks
Total value-add time: 2 days
Waste: 93%
Step 2: Pilot Team (months 1-3)
Choose one willing team with a meaningful product. Apply core practices: - Trunk-based development - CI/CD pipeline (automated build, test, deploy to staging) - Structured logging and basic monitoring - Retrospectives every sprint
Measure DORA metrics before and after. Document the results.
Step 3: Expand with Evidence (months 3+)
Use the pilot results to convince the next team. Then the next. Build a Platform team to codify what worked into shared infrastructure. The Platform team's job is to make the right practices the easy path for all teams.
THE EXPANSION ROADMAP
Month 1-3: Pilot team + CI/CD + monitoring
Month 3-6: Platform team formed, golden path CI/CD template
Month 6-9: 3-5 teams on platform
Month 9-12: GitOps, self-service deployment, SLOs
Year 2: All teams on platform, DORA metrics tracked org-wide
Year 3: DevSecOps, FinOps, Platform Engineering mature
The DevOps Mindset Shift — A Summary¶
DevOps is ultimately a set of mindset shifts. Here's the before-and-after:
| Before DevOps | After DevOps |
|---|---|
| "Not my problem" (silos) | "Our system, our problem" (shared ownership) |
| Blame the person | Fix the system |
| Change is risk | Small change is low risk |
| Deploy monthly (safety) | Deploy daily (actually safer) |
| "It works on my machine" | "It works, or it doesn't" (same environment) |
| Operations controls release | Anyone can release at any time |
| Security is the last gate | Security at every stage |
| We measure stories shipped | We measure value delivered |
| Long feedback loops | Fast feedback everywhere |
| Knowledge in people's heads | Knowledge in documented systems |
Summary¶
WHAT: Culture + Practices + Tools that enable continuous value delivery
WHY: Eliminate the wall of confusion between Dev and Ops.
Ship faster, recover faster, learn faster.
THE THREE WAYS:
1. Flow — remove obstacles between code and customer
2. Feedback — fast loops from production back to development
3. Learning — continuous improvement, never stop experimenting
CALMS PILLARS:
Culture → Automation → Lean → Measurement → Sharing
THE LOOP:
Plan → Code → Build → Test → Release → Deploy → Operate → Monitor → (repeat)
CI: Integrate daily, automated build + test on every commit
CD: Always deployable; production deployment is a business decision
CD: Every passing change auto-deploys (requires high automation maturity)
DORA METRICS (measure these, improve these):
✦ Deployment Frequency (elite: multiple/day)
✦ Lead Time for Changes (elite: < 1 hour)
✦ Change Failure Rate (elite: < 5%)
✦ Mean Time to Restore (elite: < 1 hour)
EXTENSIONS:
DevSecOps — security at every stage
FinOps — cloud cost as engineering discipline
GitOps — git as the single source of truth
MLOps — DevOps for machine learning
Platform Engineering — DevOps at scale (50+ teams)
HOW TO START:
1. Map your value stream (find the waste)
2. Run a pilot team (prove the practices)
3. Build a platform (scale what works)
Essential reading: The Phoenix Project and The DevOps Handbook by Gene Kim et al., Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, Site Reliability Engineering by Google, and Team Topologies by Matthew Skelton & Manuel Pais. These five books, together, form the complete DevOps canon.
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.