DevOps is the practice of unifying software Development and IT Operations to shorten delivery cycles, improve reliability, and eliminate toil through automation.
This page covers the full DevOps lifecycle from code commit to production incident recovery.
Prerequisites: Linux Advanced for OS internals, Shell Script for automation glue, System Design for architecture context.
History
How DevOps Was Born
In the early 2000s, development and operations were completely separate teams. Devs wrote code and threw it “over the wall” to Ops to deploy. Ops blamed Dev for broken releases. Dev blamed Ops for slow deployments. Both were right — the process was broken.
The formal term DevOps was coined around 2008–2009 at the Agile Conference by Patrick Debois and Andrew Clay Shafer, who talked about “Agile Infrastructure.” The first DevOpsDays conference was held in Ghent, Belgium in 2009.
Flickr’s famous 2009 talk “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr” by John Allspaw and Paul Hammond showed the world that shipping fast and being reliable were NOT opposites — they enabled each other.
The Lean Manufacturing movement (Toyota Production System) heavily influenced DevOps — eliminating waste, continuous improvement (Kaizen), and just-in-time delivery translated directly to software.
Who Built It
Patrick Debois — coined DevOps, organized first DevOpsDays.
Gene Kim — wrote The Phoenix Project (2013), the novel that explained DevOps to a generation of engineers. Also The DevOps Handbook and Accelerate.
Jez Humble & Dave Farley — wrote Continuous Delivery (2010), the foundational book for CI/CD pipelines.
Nicole Forsgren — lead researcher behind DORA metrics, co-author of Accelerate — the first scientific study proving DevOps practices improve software delivery performance.
Google SRE team — popularized Site Reliability Engineering as the “DevOps implementation” at hyperscale. Their SRE books are free online.
Timeline
timeline
title DevOps Evolution
2009 : First DevOpsDays — Ghent, Belgium
: "10 Deploys Per Day" — Flickr talk
2010 : Continuous Delivery book — Humble & Farley
: Chef and Puppet gain popularity
2011 : Amazon EC2, S3 normalize cloud infrastructure
: Vagrant released — dev/prod parity
2013 : Docker released — container revolution begins
: The Phoenix Project book published
2014 : Kubernetes announced by Google
: Terraform 0.1 released
: Ansible 1.0 gains traction
2015 : Kubernetes 1.0 — donated to CNCF
: DORA research program begins
2016 : GitLab CI becomes mainstream
: Prometheus joins CNCF
2017 : GitHub Actions announced
: ArgoCD released
2018 : DevSecOps becomes mainstream
: Accelerate book — DORA metrics proven
2020 : GitOps movement formalizes
: Platform Engineering emerges
2022 : AI-assisted DevOps — Copilot in pipelines
2024 : Platform Engineering + AI Ops standard practice
Why DevOps Matters
Before DevOps: deployments were monthly events with full weekend downtime, war rooms, rollback plans, and post-mortems that blamed people.
After DevOps: Elite organizations deploy multiple times per day with automated pipelines, feature flags, canary releases, and automatic rollbacks — no war rooms, no weekend sacrifices.
The 2023 DORA report shows elite performers deploy 182× more frequently and have 2,604× faster recovery from incidents than low performers.
Introduction
DevOps is a culture, set of practices, and toolchain that eliminates the wall between the people who build software and the people who run it.
The core insight is simple but powerful: the people who write the code should also care about whether it runs in production. When developers own reliability, they build more reliable software. When operations teams understand code, they build better infrastructure.
Self-service infra, golden paths, internal platforms
In a small company, one person does all three. At Google, they are distinct disciplines with thousands of engineers each.
CI/CD — Continuous Integration & Delivery
What is CI/CD — The Real Explanation
Continuous Integration (CI) means every developer merges their code into a shared branch at least once per day, and every merge automatically triggers a build, lint check, and test suite. The goal is to catch integration problems immediately — not weeks later when a big release happens.
Before CI: developers worked in long-lived branches for weeks, then merged everything at once. The result was called “integration hell” — conflicts everywhere, tests failing for mysterious reasons, nobody sure what broke what.
Continuous Delivery (CD) extends CI by ensuring that every successful build is automatically deployable to any environment — staging, pre-prod, or production. You choose when to deploy; the system ensures it’s always ready.
Continuous Deployment (sometimes confused with Delivery) goes one step further — every successful pipeline run automatically deploys to production with no human approval gate.
The key rule: fail fast. If linting fails, don’t bother building. If unit tests fail, don’t bother running security scans. Every minute of CI time costs money and developer patience.
Deployment Strategies — Choosing the Right One
How you release to production determines your risk, rollback speed, and downtime. Each strategy is a tradeoff:
Strategy
How It Works
Downtime
Rollback
Risk
Use When
Recreate
Stop v1, start v2
Yes
Slow (redeploy)
High
Dev/test only
Rolling
Replace pods one by one
None
Medium (rollout undo)
Medium
Standard apps
Blue-Green
Two identical envs, swap traffic
None
Instant (swap back)
Low
Critical services
Canary
Route 5% traffic to new version, watch metrics
None
Instant
Very Low
High-risk changes
A/B Testing
Route by user segment (geography, plan tier)
None
Instant
Low
Feature experiments
Feature Flags
Deploy dark, toggle per user/region in runtime
None
Instant
Minimal
Any change
Feature flags (LaunchDarkly, Flagsmith, Unleash) are the most powerful pattern — they decouple deployment from release. You ship code to 100% of servers but only enable it for 1% of users. No pipeline changes needed to roll back — just toggle a flag.
CI/CD Tools
GitHub Actions — workflow-as-code built directly into GitHub. Uses YAML trigger definitions, reusable actions from the marketplace, and matrix builds for parallel jobs. Best for projects hosted on GitHub.
GitLab CI — deep integration with GitLab repos. YAML-based pipeline DSL with shared runners, environments, auto-DevOps, and built-in container registry. Strong choice for self-hosted enterprise.
Jenkins — the original CI server (2004). Groovy-based Jenkinsfile, 1800+ plugins, runs anywhere. Powerful but operationally heavy — you manage the server. Best when you need maximum flexibility.
Circle CI — cloud-native, fast, Docker-first. Orb reuse system for shared config. Strong caching and parallelism.
Travis CI — simple, hosted, great for open-source. Less popular after 2021 pricing changes.
Bamboo — Atlassian’s CI tool. Tight Jira integration, build plans with stages, branching.
TeamCity — JetBrains CI. Kotlin DSL for pipelines, smart build chains, excellent for JVM projects.
ArgoCD — not a traditional CI tool — it’s a GitOps CD controller for Kubernetes. Git is the source of truth; ArgoCD continuously reconciles the cluster to match it. More on this in the GitOps section.
Git is the foundation of every DevOps pipeline. Your branching strategy determines how fast you can ship and how many merge conflicts you fight. No perfect strategy exists — pick the one that matches your team size and release cadence.
Branching Strategies Compared
Strategy
Branches
Best For
Risk
Trunk-Based Development
Single main, short-lived feature branches (<1 day)
High-frequency deployment, feature flags
Low merge conflicts, needs strong CI
Git Flow
main + develop + feature/* + release/* + hotfix/*
Scheduled releases, versioned products
Complex, many long-lived branches
GitHub Flow
main + feature branches
Web apps with continuous deployment
Simple, no staging branch
GitLab Flow
main + production + staging environment branches
Multiple environments
Medium complexity
Industry Trend Trunk-Based Development with feature flags. Git Flow is still common in firmware, libraries, and enterprise software with versioned releases.
Modern high-performing teams (Google, Meta, Netflix) use
Semantic Versioning (SemVer)
MAJOR.MINOR.PATCH → 2.14.3
│ │ └── Bug fixes — no API changes
│ └── New features — backward compatible
└── Breaking changes — not backward compatible
Pre-release tags:
2.14.3-alpha.1 → early unstable
2.14.3-beta.2 → feature complete, buggy
2.14.3-rc.1 → release candidate, near stable
Conventional Commits (auto-generates changelog):
feat: add user authentication → bumps MINOR
fix: correct null pointer crash → bumps PATCH
feat!: redesign API endpoints → bumps MAJOR (! = breaking)
chore: update dependencies → no version bump
Infrastructure as Code (IaC)
What is IaC — The Real Explanation
Before IaC, servers were configured manually — a sysadmin would SSH in, install packages, edit config files, and write nothing down. This created snowflake servers: each one slightly different, impossible to reproduce, fragile to touch. “Works on my server” was a real problem.
IaC treats infrastructure — servers, networks, load balancers, databases, DNS records — as source code. You write a file describing what you want, commit it to Git, and a tool makes reality match the file. The result is:
Reproducibility — spin up identical environments in minutes
Auditability — Git history shows who changed what and why
Automation — no manual steps, no forgotten config, no snowflakes
Disaster Recovery — losing a server means running terraform apply, not a day of manual work
There are two main layers of IaC: Provisioning (create the servers) and Configuration Management (configure what’s on them).
Terraform is the industry standard for cloud provisioning. It works by talking to cloud provider APIs (AWS, GCP, Azure, etc.) to create, update, and destroy resources.
The key concept is state — Terraform keeps a .tfstate file that maps your config to real cloud resources. Always store state remotely (S3 + DynamoDB, Terraform Cloud) and never commit it to Git (it contains secrets).
Terraform core workflow
terraform init # download providers + initialize backendterraform fmt # auto-format all .tf files (run in CI)terraform validate # check syntax + provider schematerraform plan # show exactly what will change — ALWAYS read thisterraform apply # apply changes (asks for approval)terraform apply -auto-approve # skip approval (CI/CD only)terraform destroy # tear down all managed resourcesterraform state list # show all tracked resourcesterraform state show aws_instance.web # inspect a specific resourceterraform output # print output valuesterraform import aws_s3_bucket.my_bucket my-existing-bucket # import existing resource
main.tf — production-ready AWS setup
terraform { required_version = ">= 1.6.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" # pin major version, allow minor updates } } # Remote state — critical for teams backend "s3" { bucket = "my-company-tfstate" key = "prod/myapp/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" # prevents concurrent applies encrypt = true }}provider "aws" { region = var.aws_region default_tags { tags = { Project = "myapp" Environment = var.environment ManagedBy = "terraform" } }}# Variables — parameterize everythingvariable "environment" { type = string }variable "aws_region" { type = string default = "us-east-1" }variable "instance_type"{ type = string default = "t3.medium" }# VPC — isolated networkresource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true tags = { Name = "${var.environment}-vpc" }}# Subnets — public for web, private for app/dbresource "aws_subnet" "public" { count = 2 vpc_id = aws_vpc.main.id cidr_block = "10.0.${count.index}.0/24" availability_zone = data.aws_availability_zones.available.names[count.index] map_public_ip_on_launch = true tags = { Name = "${var.environment}-public-${count.index}" }}# EC2 instanceresource "aws_instance" "web" { ami = data.aws_ami.ubuntu.id instance_type = var.instance_type subnet_id = aws_subnet.public[0].id vpc_security_group_ids = [aws_security_group.web.id] iam_instance_profile = aws_iam_instance_profile.web.name user_data = base64encode(templatefile("scripts/user_data.sh", { environment = var.environment })) tags = { Name = "${var.environment}-web" }}# Outputs — use in other modules or CI scriptsoutput "instance_ip" { value = aws_instance.web.public_ip }output "vpc_id" { value = aws_vpc.main.id }
Ansible is agentless — it connects via SSH and runs tasks on remote servers. Nothing to install on the target. Configuration is YAML playbooks. See Ansible for the full reference page.
Ansible is best for: configuring servers after Terraform creates them, rolling deployments, ad-hoc operations across a fleet.
# Inventory checkansible-inventory -i inventory.ini --list # show all hosts as JSONansible-inventory -i inventory.ini --graph # show host tree# Connectivity testansible all -m ping -i inventory.ini# Ad-hoc commands — quick operations without a playbookansible webservers -m shell -a "systemctl status nginx" -i inventory.iniansible all -m apt -a "name=curl state=present" -b -i inventory.ini# Run playbookansible-playbook playbook.yml -i inventory.iniansible-playbook playbook.yml -i inventory.ini --check # dry-runansible-playbook playbook.yml -i inventory.ini --diff # show file diffsansible-playbook playbook.yml -i inventory.ini --tags deploy # only deploy tasksansible-playbook playbook.yml -i inventory.ini --limit web1 # only one host# Vault — encrypt secretsansible-vault encrypt secrets.ymlansible-vault decrypt secrets.ymlansible-vault edit secrets.ymlansible-playbook playbook.yml --ask-vault-passansible-playbook playbook.yml --vault-password-file ~/.vault_pass
Containers & Orchestration
What is a Container — The Real Explanation
A container is not a virtual machine. A VM emulates entire hardware — CPU, RAM, disk, network card — with its own kernel. It’s heavy (GBs of RAM, minutes to boot).
A container is just an isolated process on the host OS, using Linux kernel features to fake isolation:
Namespaces — process sees its own PID list, network stack, filesystem mount points (covered deeply in Linux Advanced)
cgroups — limits how much CPU, RAM, and disk I/O the process can use
OverlayFS — layered filesystem so containers share base image layers (saves disk space)
The result: containers start in milliseconds, use MBs of RAM overhead, and thousands can run on one machine. The tradeoff is they share the host kernel — a kernel exploit affects all containers.
Docker Deep Reference
Docker packages your app + its dependencies into a portable image that runs identically on any machine with Docker installed. “Works on my machine” finally ends.
Docker essential commands
# ── Images ─────────────────────────────────────────────────────docker build -t myapp:1.0 . # build from Dockerfile in current dirdocker build -t myapp:1.0 -f prod.Dockerfile . # specify Dockerfiledocker images # list local imagesdocker pull ubuntu:22.04 # download image from registrydocker push registry.company.com/myapp:1.0docker image inspect myapp:1.0 # full image metadatadocker image prune -a # remove all unused images# ── Containers ─────────────────────────────────────────────────docker run -d -p 8080:80 --name web myapp:1.0# │ │ │ └── image# │ │ └── name the container# │ └── host_port:container_port# └── detached (background)docker run --rm -it myapp:1.0 bash # interactive, auto-delete on exitdocker ps -a # all containers (running + stopped)docker stop web # graceful stop (SIGTERM)docker kill web # immediate stop (SIGKILL)docker rm web # delete stopped containerdocker logs -f --tail=100 web # tail logsdocker exec -it web bash # shell inside running containerdocker stats # live CPU/memory usage# ── Volumes & Networks ─────────────────────────────────────────docker volume create mydatadocker run -v mydata:/app/data myapp:1.0 # named volume (persistent)docker run -v $(pwd)/data:/app/data myapp:1.0 # bind mount (host path)docker network create mynetdocker run --network mynet myapp:1.0 # join networkdocker network inspect mynet # see connected containers# ── Cleanup ────────────────────────────────────────────────────docker system prune -af --volumes # nuclear option — clean everything
Dockerfile — production multi-stage build (Python)
# ══ Stage 1: Build dependencies ══════════════════════════FROM python:3.12-slim AS builder# Install system build deps (won't be in final image)RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ libpq-dev \ && rm -rf /var/lib/apt/lists/*WORKDIR /buildCOPY requirements.txt .# Install to /usr/local — copied to final imageRUN pip install --prefix=/install --no-cache-dir -r requirements.txt# ══ Stage 2: Lean production image ═══════════════════════FROM python:3.12-slim# Runtime deps onlyRUN apt-get update && apt-get install -y --no-install-recommends \ libpq5 \ && rm -rf /var/lib/apt/lists/*# Copy installed packages from builderCOPY --from=builder /install /usr/local# Non-root user — security best practiceRUN useradd -m -u 1000 appuserWORKDIR /appCOPY --chown=appuser:appuser . .USER appuser# Health check — Docker and Kubernetes use thisHEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1EXPOSE 8000CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Docker Compose runs a multi-container app with a single docker-compose up. Perfect for local development — one command starts your app + database + cache + message queue.
docker compose up -d # start all services in backgrounddocker compose up --build # rebuild images then startdocker compose ps # status of all servicesdocker compose logs -f web # follow logs for web servicedocker compose exec web bash # shell into running containerdocker compose run --rm web pytest # one-off command (runs+deletes)docker compose down # stop and remove containersdocker compose down -v # also delete volumes (clean slate)docker compose scale worker=3 # run 3 worker containers
Kubernetes — Deep Reference
Kubernetes (K8s) is a container orchestration platform — it automates deploying, scaling, networking, health checking, and self-healing of containerized applications across a cluster of machines.
The mental model: you describe desired state (“I want 3 replicas of this container, always running”), and Kubernetes continuously makes reality match that description. It restarts crashed pods, reschedules pods from failed nodes, scales based on traffic, and manages rolling updates.
graph TD
subgraph "Kubernetes Cluster"
subgraph "Control Plane"
API["API Server\nAll requests go here"]
ETCD["etcd\nDistributed key-value\nCluster state database"]
SCHED["Scheduler\nAssigns pods to nodes"]
CM["Controller Manager\nReconciliation loops"]
end
subgraph "Worker Node 1"
KL1["kubelet\nManages pods on this node"]
KP1["kube-proxy\nNetwork rules (iptables/eBPF)"]
P1["Pod\n[container + container]"]
P2["Pod\n[container]"]
end
subgraph "Worker Node 2"
KL2["kubelet"]
P3["Pod\n[container]"]
end
API <--> ETCD
API --> SCHED
API --> CM
API --> KL1
API --> KL2
end
deployment.yaml — production Deployment manifest
apiVersion: apps/v1kind: Deploymentmetadata: name: myapp namespace: production labels: app: myappspec: replicas: 3 # desired pod count selector: matchLabels: app: myapp strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # max pods above replicas during update maxUnavailable: 0 # zero-downtime: never kill before new is ready template: metadata: labels: app: myapp spec: containers: - name: myapp image: ghcr.io/company/myapp:sha-abc123 ports: - containerPort: 8000 # Resource limits — always set these resources: requests: # guaranteed allocation cpu: "100m" # 0.1 CPU core memory: "128Mi" limits: # hard ceiling cpu: "500m" memory: "512Mi" # Probes — Kubernetes uses these to manage traffic livenessProbe: # restart if this fails httpGet: path: /health port: 8000 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: # only send traffic if this passes httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5 # Environment from ConfigMap + Secret envFrom: - configMapRef: name: myapp-config - secretRef: name: myapp-secrets # Don't schedule two replicas on same node topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: myapp---apiVersion: v1kind: Servicemetadata: name: myapp namespace: productionspec: selector: app: myapp ports: - port: 80 targetPort: 8000 type: ClusterIP---apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: myapp-hpa namespace: productionspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: AverageUtilization averageUtilization: 70 # scale up if CPU > 70%
kubectl essential commands
# ── Cluster info ────────────────────────────────────────────kubectl cluster-infokubectl get nodes -o wide # all nodes with IPskubectl top nodes # CPU/memory per node# ── Namespaces ──────────────────────────────────────────────kubectl get namespaceskubectl create namespace stagingkubectl config set-context --current --namespace=production # set default ns# ── Workloads ───────────────────────────────────────────────kubectl get pods -n production -o wide # pods with node placementkubectl get pods -A # all namespaceskubectl describe pod myapp-abc123 -n production # full pod detailskubectl logs -f myapp-abc123 -c myapp # follow logs, specific containerkubectl logs -l app=myapp --tail=100 # logs from all pods with labelkubectl exec -it myapp-abc123 -- bash # shell into pod# ── Deployments ─────────────────────────────────────────────kubectl apply -f deployment.yaml # apply manifestkubectl rollout status deployment/myapp # watch rollout progresskubectl rollout history deployment/myapp # revision historykubectl rollout undo deployment/myapp # rollback to previous revisionkubectl set image deployment/myapp myapp=myapp:new-tag # update imagekubectl scale deployment/myapp --replicas=5# ── Debugging ──────────────────────────────────────────────kubectl get events -n production --sort-by='.lastTimestamp'kubectl port-forward svc/myapp 8080:80 # local access to servicekubectl run debug --rm -it --image=busybox -- sh # temporary debug podkubectl top pods -n production # resource usage
GitOps with ArgoCD
GitOps is the practice of using Git as the single source of truth for your infrastructure and application configuration. Instead of running kubectl apply from a CI pipeline (push-based), a controller running inside the cluster continuously pulls config from Git and reconciles the cluster state.
Why GitOps is better than push-based:
No cluster credentials in your CI server (reduced attack surface)
Every deployment is recorded in Git history with author, timestamp, and message
Drift detection — if someone runs kubectl edit manually, ArgoCD detects and corrects it
Rollback = git revert — no special tooling needed
ArgoCD is the most widely adopted GitOps controller. See ArgoCD for the full reference including Application CRD, sync policies, App-of-Apps pattern, and secret management strategies.
Monitoring, Logging & Observability
Observability vs Monitoring — The Difference
Monitoring is watching known metrics for threshold breaches. “Alert when CPU > 90%.” You know what to look for.
Observability is the ability to ask arbitrary questions about a system’s internal state from its external outputs — without deploying new code. “Why are 0.3% of requests to /checkout failing for users in Singapore?”
The difference matters when you’re debugging a production incident you’ve never seen before. Monitoring tells you something is wrong. Observability helps you understand why.
The Three Pillars are complementary — you need all three to fully understand a system:
graph TD
M["📊 METRICS\nNumerical measurements over time\nCPU · Memory · Request rate\nError rate · Latency percentiles\nTools: Prometheus + Grafana"]
L["📋 LOGS\nTimestamped event records\nApplication errors · Access logs\nAudit trails · Debug output\nTools: Loki · ELK · Fluentd"]
T["🔍 TRACES\nRequest journey across services\nWhere time was spent\nWhich service caused slowness\nTools: Jaeger · Tempo · Zipkin"]
M -->|"Know something is wrong"| Incident
L -->|"Find what happened"| Incident
T -->|"Find where it happened"| Incident
Incident["🚨 Incident Resolution"]
Prometheus + Grafana — Deep Reference
Prometheus is a pull-based time-series metrics database. It scrapes /metrics endpoints on a schedule and stores numeric measurements with labels. The query language is PromQL.
Grafana connects to Prometheus (and dozens of other data sources) to build dashboards, set alert rules, and visualize system health.
prometheus.yml — full scrape configuration
global: scrape_interval: 15s # how often to scrape each target evaluation_interval: 15s # how often to evaluate alert rules external_labels: cluster: production region: us-east-1# Alert manager — where to send firing alertsalerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"]# Load alert rules from filesrule_files: - "alerts/application.yml" - "alerts/infrastructure.yml"scrape_configs: # ── Kubernetes pods ────────────────────────────────────── - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: # Only scrape pods with annotation prometheus.io/scrape: "true" - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: "true" # Use custom port if annotation set - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: $1 # ── Node exporter (OS metrics per machine) ─────────────── - job_name: node-exporter static_configs: - targets: ["node-exporter:9100"] # ── Kubernetes API server ──────────────────────────────── - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
PromQL cheat sheet:
─────────────────────────────────────────────────────────────
# Rate of requests per second over 5-minute window
rate(http_requests_total[5m])
# Filter by label
rate(http_requests_total{status="200", service="api"}[5m])
# Sum across all instances, group by service
sum(rate(http_requests_total[5m])) by (service)
# p95 latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Memory used as % of limit
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# Increase over last hour (for counters)
increase(errors_total[1h])
# 5-minute moving average of CPU
avg_over_time(cpu_usage_percent[5m])
SLIs, SLOs, SLAs, and Error Budgets
This framework, pioneered by Google SRE, makes reliability a data-driven engineering conversation instead of a political argument between Dev (“ship faster”) and Ops (“don’t break things”).
SLI (Service Level Indicator) — a carefully defined measurement of a service’s behavior. Not just “is it up?” but precise: “What percentage of requests complete successfully in under 500ms?”
SLO (Service Level Objective) — your target for the SLI. “99.9% of requests succeed in under 500ms over a rolling 28-day window.”
Error Budget — the allowed failure budget. If your SLO is 99.9%, you have 0.1% = ~43 minutes/month you’re allowed to be unreliable. This is budget you spend, not a punishment.
SLA (Service Level Agreement) — the contractual commitment to customers. Usually more lenient than your SLO (e.g., SLO = 99.9%, SLA = 99.5%). If you breach SLA, you pay credits.
Error Budget Math:
─────────────────────────────────────────────────────────────
SLO: 99.9% availability over 30 days
Error Budget = 100% - 99.9% = 0.1%
30 days = 30 × 24 × 60 = 43,200 minutes
Allowed downtime = 43,200 × 0.001 = 43.2 minutes/month
If your last deploy caused 1 hour of degradation:
→ You've consumed 139% of your monthly error budget
→ Feature work pauses, team focuses on reliability
If your budget is untouched after 3 weeks:
→ You're being too conservative
→ Ship faster, take more risk, the budget exists for a reason
Centralized Logging — ELK Stack
Raw logs from hundreds of containers are useless unless you can search, filter, and correlate them. The ELK Stack (Elasticsearch + Logstash + Kibana) or the modern PLG Stack (Promtail + Loki + Grafana) centralizes all logs.
In a microservices architecture, a single user request may touch 10+ services. When it’s slow, which service caused it? Distributed tracing answers this by propagating a trace ID through every service call.
OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs. You instrument your code once and send to Jaeger, Tempo, Datadog, or any backend.
OpenTelemetry — instrument a Python FastAPI service
from opentelemetry import tracefrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentorfrom opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentorfrom opentelemetry.instrumentation.requests import RequestsInstrumentorfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.sdk.resources import SERVICE_NAME, Resource# Configure tracer — do this at startupresource = Resource(attributes={SERVICE_NAME: "payment-service"})provider = TracerProvider(resource=resource)provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317")))trace.set_tracer_provider(provider)# Auto-instrument FastAPI, SQLAlchemy, and outbound HTTPFastAPIInstrumentor.instrument_app(app)SQLAlchemyInstrumentor().instrument(engine=engine)RequestsInstrumentor().instrument()# Manual span for custom operationstracer = trace.get_tracer(__name__)async def process_payment(order_id: str, amount: float): with tracer.start_as_current_span("process_payment") as span: span.set_attribute("order.id", order_id) span.set_attribute("payment.amount", amount) try: result = await charge_card(amount) span.set_attribute("payment.status", "success") return result except PaymentError as e: span.set_status(trace.Status(trace.StatusCode.ERROR, str(e))) span.record_exception(e) raise
DevSecOps — Security in the Pipeline
What is DevSecOps
DevSecOps means security is a shared responsibility baked into every pipeline stage — not a separate team that reviews code once a quarter. The mantra is “shift left”: catch vulnerabilities as early as possible, when they’re cheapest to fix.
A vulnerability found in code review costs 80tofix.Thesamevulnerabilityfoundinproductioncosts7,600 (NIST study). The math makes the case.
Unexpected process execution, network calls, file writes
Secret Management — Never Commit Secrets
The #1 DevSecOps rule: never commit secrets to Git. API keys, database passwords, certificates, tokens — none of it belongs in source code. Even in a private repo. Even “temporarily.”
A secret committed to Git is permanent — even after deletion, it exists in git history, forks, CI logs, and developer laptops.
Trivy — universal security scanner
# Scan a container image for OS + app vulnerabilitiestrivy image python:3.12-slimtrivy image --severity HIGH,CRITICAL myapp:latest# Scan source code / filesystemtrivy fs .trivy fs --scanners vuln,secret,misconfig .# Scan Kubernetes clustertrivy k8s --report summary cluster# Scan Terraform / Helm IaCtrivy config ./terraform/trivy config ./helm-charts/# CI output — fail pipeline on critical vulnerabilitiestrivy image --exit-code 1 --severity CRITICAL myapp:latest# Generate SBOM (Software Bill of Materials)trivy image --format cyclonedx --output sbom.json myapp:latest
SRE is Google’s answer to the question: “How do you scale operations when you can’t linearly scale the ops team?” Ben Treynor Sloss invented it around 2003 — hire software engineers and give them operational responsibility. The result: they automate everything instead of doing it manually.
The key difference between SRE and traditional Ops: SRE teams have a toil budget (maximum 50% of time on manual work) and an error budget (allowed unreliability). Both are enforced.
The Google SRE books are free online — they’re among the most influential engineering documents ever written.
Incident Management
When production breaks, a structured response process reduces MTTR (Mean Time to Restore) and prevents chaos.
graph TD
DETECT["🚨 DETECT\nAlert fires\nCustomer report\nSynthetic monitor fails"]
TRIAGE["🔍 TRIAGE\nHow bad is it?\nWhat's affected?\nWho owns it?"]
DECLARE["📢 DECLARE\nCreate incident channel\nAssign Incident Commander\nNotify stakeholders"]
MITIGATE["🛠️ MITIGATE\nRestore service FIRST\nRollback / Feature flag off\nScale up / Reroute traffic"]
INVESTIGATE["🔬 INVESTIGATE\nFind root cause\nWhile service is stable"]
RESOLVE["✅ RESOLVE\nClose incident\nPost-mortem scheduled"]
POSTMORTEM["📝 POST-MORTEM\nBlameless analysis\nTimeline\nAction items"]
DETECT --> TRIAGE --> DECLARE --> MITIGATE --> INVESTIGATE --> RESOLVE --> POSTMORTEM
The golden rule: mitigate first, investigate second. A rolled-back deploy that ends an incident in 10 minutes beats 2 hours of debugging with users down.
Blameless post-mortems are essential. When engineers fear blame, they hide information, don’t take risks, and incidents recur. Blameless means: the system failed, not the person. Find the systemic issue.
Toil — What It Is and Why to Eliminate It
Toil is manual, repetitive, automatable operational work that scales linearly with service growth. Restarting crashed pods manually, provisioning accounts by hand, running SQL migrations via SSH.
Toil is not “bad work” — it’s work that should eventually not exist. SRE caps toil at 50% of team time. The rest goes to engineering work that permanently reduces future toil.
Toil Test — Ask these questions:
─────────────────────────────────────────────────────────
Manual? → Is it triggered by a human every time?
Repetitive? → Do you do it more than once?
Automatable? → Could a machine do this exactly?
No lasting value? → Does it leave the system better? Or just running?
Scales with load? → Does it grow as your service grows?
If yes to most → it's toil → automate it, eliminate it
DORA Metrics — Measuring DevOps Performance
DORA (DevOps Research and Assessment) identified four metrics that distinguish elite engineering teams from low performers. Nicole Forsgren’s research (published in Accelerate, 2018) proved these metrics predict organizational performance.
Metric
Elite
High
Medium
Low
Deployment Frequency
Multiple/day
Weekly
Monthly
Every 6 months
Lead Time for Changes
< 1 hour
1 day – 1 week
1 week – 1 month
1 – 6 months
MTTR (restore time)
< 1 hour
< 1 day
1 – 7 days
1 – 6 months
Change Failure Rate
0 – 15%
0 – 15%
16 – 30%
46 – 60%
These metrics are leading indicators — improving them predicts better business outcomes (profitability, market share, customer satisfaction).
Platform Engineering
What is Platform Engineering
Platform Engineering is the next evolution after DevOps. Instead of every team building their own CI/CD pipelines, Terraform modules, and observability setup, a dedicated Platform Engineering team builds an Internal Developer Platform (IDP) — a self-service layer that gives developers everything they need to deploy, monitor, and operate their services without deep infrastructure knowledge.
The analogy: DevOps is teaching every developer to cook. Platform Engineering is building a restaurant kitchen where developers just place orders.
Golden Paths
A Golden Path is the recommended, supported, opinionated way to do something — deploy an app, set up monitoring, manage secrets. It’s called “golden” because it’s paved and maintained, not because it’s the only way.
Without Golden Path (every team reinvents):
Team A: Jenkins + custom Bash + manual Terraform
Team B: GitHub Actions + Helm + manual kubectl
Team C: CircleCI + Docker Compose + SSH deployment
Result: 3 ways to deploy, 3 incident runbooks, 3 security reviews
With Golden Path:
Every team: GitHub Actions template → standard Helm chart → ArgoCD
Result: 1 deployment system, 1 runbook, 1 security review → 10× faster onboarding
DORA Metrics for Platform Teams
Platform teams measure success differently — their “users” are internal developers:
kestra is the open-source, declarative workflow orchestration platform for data, infra, and AI pipelines. YAML-first, 1400+ plugins, GitOps-native. Covers the orchestration gap between simple CI/CD (run on code push) and complex multi-step workflows (ETL, infra automation, AI pipelines).
Use it for: running Ansible playbooks on a schedule, triggering Terraform plans from events, data ingestion pipelines, and AI model retraining workflows.
→ Full reference including triggers, parallel tasks, error handling, retries: kestra
Project Management Tools
Jira — issue tracking, sprint planning, and roadmapping used widely in DevOps teams. Deep integrations with GitHub Actions, Jenkins, and GitLab CI for build status, deployment tracking, and incident tickets.
Trello — simple Kanban boards, great for small teams or personal task tracking.
Asana — structured project management with timelines, dependencies, and workload views.
DevOps does not exist in isolation — it is the connective tissue between almost every other engineering discipline. Here is where to go next based on what you want to deepen.
The OS layer underneath everything — every container, every cron job, and every CI runner sits on Linux. Linux Advanced covers kernel internals, namespaces, cgroups (the building blocks of containers), performance tuning with perf, and security hardening with SELinux and AppArmor. If you want to understand whykubectl exec works the way it does, start there.
The glue that holds pipelines together — most CI steps are Shell Script under the hood. Bash scripting, text processing with awk/sed, job control, and process management are skills that pay off in every pipeline you write.
Configuring what Terraform creates — once Infrastructure as Code IaC provisions your servers, Ansible configures them: installing packages, deploying apps, managing systemd services, and rotating secrets with Ansible Vault. SaltStack takes a more reactive approach — event-driven config that responds to infrastructure changes automatically.
GitOps in depth — ArgoCD is how mature teams implement continuous delivery for Kubernetes. Git becomes the source of truth; every merge to main automatically reconciles the cluster. Covers Application CRDs, the App-of-Apps pattern, and handling secrets safely in GitOps workflows.
Your CI/CD toolbox — GitHub Actions is the modern default for GitHub-hosted projects (reusable workflows, matrix builds, OIDC auth). GitLab CI is its equal for GitLab shops with stronger built-in environments. Jenkins is the veteran — more complex to operate but unmatched in plugin depth and flexibility. Circle CI, Travis CI, Bamboo, and TeamCity each serve specific team needs covered in their own pages.
Security in every pipeline stage — Cybersecurity covers the threat models and defense principles that motivate every DevSecOps check. Cybersecurity Architecture goes deeper into Zero Trust network design, IAM, and SIEM/SOAR — the enterprise security layer that DevOps pipelines must integrate with.
Kernel-level observability — eBPF is how Falco, Cilium, and Tetragon hook into the kernel to detect runtime threats and network anomalies without any application changes. It is increasingly the standard for production security and observability in Kubernetes environments.
The CI/CD concepts themselves — Continuous Integration and Continuous Delivery each have dedicated pages explaining the principles, practices, and tradeoffs in depth — not just the tools.
When your pipelines need to orchestrate complex multi-step workflows — Automation covers scripting patterns, task scheduling, and kestra which handles ETL pipelines, infrastructure automation jobs, and AI workflows that outgrow simple CI/CD.
The architecture DevOps serves — System Design is the bigger picture: how to design the systems these pipelines deploy. System Design - Microservices shows the architecture that makes independent deployment so critical. System Design - Scalability & CAP explains why horizontal scaling (which IaC automates) exists.
Keeping your team in sync — Jira tracks sprint work and incident tickets, integrates with every CI tool listed above, and is where action items from post-mortems land. Trello and Asana serve lighter planning needs. Azure DevOps is Microsoft’s all-in-one: Boards, Repos, Pipelines, Artifacts, and Test Plans under one roof.