About This Page

Covers microservices architecture and distributed system patterns — service communication, resilience, and data consistency. Parent: System Design. See also: System Design - APIs & Networking, System Design - Databases.

Don't Start with Microservices

Monolith first, then extract services as pain points emerge. Premature microservices = distributed monolith with all the complexity and none of the benefits.

Monolith vs Microservices

Decision Framework

flowchart TD
    Start[New Project?] --> Small{Team < 10\ndevs?}
    Small -->|Yes| Monolith[Start with Monolith ✅]
    Small -->|No| Scale{Scaling\npain points?}
    Scale -->|No| Monolith
    Scale -->|Yes| Extract[Extract Services\nStrangler Fig Pattern]
    Extract --> MS[Microservices 🏗]

Trade-offs Table

AspectMonolithMicroservices
Initial complexityLowHigh
DeployOne artifactN independent deployments
ScaleEntire app togetherIndividual services
Fault isolationNone — one bug crashes allFailures contained per service
Tech stackUniformPolyglot (each service chooses)
TestingSimpleComplex (integration, contract)
Team modelFeature teamsService ownership teams
DataShared DB (simple)DB per service (complex)

Service Communication

Sync vs Async

ModeHowWhenExamples
SynchronousCaller waits for responseUser-facing, immediate result neededREST, gRPC
AsynchronousFire and forget via queueBackground jobs, no immediate result neededKafka, RabbitMQ, SQS

Message Queue Patterns

Point-to-Point (Queue):
  Producer → [Queue] → Single Consumer
  Message deleted after consumed
  Use: Task queues, job processing

Publish-Subscribe (Topic):
  Producer → [Topic] → Consumer A
                    → Consumer B
                    → Consumer C
  All subscribers receive every message
  Use: Event notifications, fan-out

Example — Order Service:
  Order created → publishes "order.created"
  ├── Email Service       → sends confirmation
  ├── Inventory Service   → reserves stock
  └── Analytics Service   → logs event

Message Queue Comparison

FeatureKafkaRabbitMQAWS SQS
TypeDistributed logAMQP brokerManaged cloud
RetentionDays/weeks (replay!)Deleted on consumeUp to 14 days
ThroughputMillions/secTens of thousands/secUnlimited
Replay✅ Any offset
Best forEvent streaming, CQRS, auditTask queues, RPCAWS serverless

API Gateway

Responsibilities

flowchart LR
    Client --> GW[API Gateway]
    GW -->|Auth ✓| Auth[Auth Service]
    GW -->|Route| US[User Service]
    GW -->|Route| OS[Order Service]
    GW -->|Route| PS[Product Service]
    GW -->|Rate limit| RL[Rate Limiter]
API Gateway handles:
  ✅ Authentication & Authorization
  ✅ Rate limiting & throttling
  ✅ Request routing & load balancing
  ✅ SSL termination
  ✅ Request/response transformation
  ✅ Logging, metrics, distributed tracing
  ✅ Caching (sometimes)
  ✅ Circuit breaking

Popular: Kong, AWS API Gateway, Nginx, Traefik, Envoy

Service Discovery

Client-Side vs Server-Side

Client-Side Discovery:
  Client → Service Registry → Gets list of instances
  Client picks one (round-robin/random) → Calls directly
  Pro: No extra hop. Con: Discovery logic in every client.
  Tool: Netflix Eureka

Server-Side Discovery:
  Client → Load Balancer → LB queries registry → Routes request
  Pro: Clients stay simple. Con: Extra hop via LB.
  Tool: AWS ALB + ECS service discovery, Kubernetes Service

Service Registry:
  Consul, Eureka, etcd, Zookeeper
  Services register on startup, deregister on shutdown
  Health checks remove unhealthy instances automatically

Resilience Patterns

Circuit Breaker

stateDiagram-v2
    [*] --> CLOSED : Normal
    CLOSED --> OPEN : N failures in window
    OPEN --> HALF_OPEN : Timeout expires
    HALF_OPEN --> CLOSED : Test request ✅
    HALF_OPEN --> OPEN : Test request ❌
CLOSED:    Requests pass through normally
OPEN:      Requests fail fast — no call to failing service
HALF-OPEN: Allow one test request to check recovery

Benefits:
  ✅ Prevents cascading failures
  ✅ Gives failing service time to recover
  ✅ Fast failure (no waiting for timeout)

Libraries: Resilience4j (Java), Polly (.NET), Hystrix (deprecated)

Retry with Exponential Backoff

import time, random
 
def call_with_retry(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # jitter
            time.sleep(wait)
    # Waits: 1s, 2s, 4s, 8s, 16s (+ jitter each time)

Bulkhead Pattern

Isolate thread pools per downstream service.
If Service A is slow and exhausts its pool, Service B/C are unaffected.

Thread Pool Bulkhead:
  Service A calls: pool of 20 threads
  Service B calls: pool of 20 threads
  Service C calls: pool of 20 threads

Semaphore Bulkhead:
  Limit concurrent calls to N (simpler, same thread)

Key insight: Don't let one slow dependency exhaust shared resources.

Distributed Data Patterns

Saga Pattern (Distributed Transactions)

Problem: Distributed transactions across services (2PC is too slow)

Saga = sequence of local transactions with compensating transactions on failure

Choreography-based Saga:
  Order Service → creates order → publishes "order.created"
  Payment Service → charges card → publishes "payment.done"
  Inventory Service → reserves stock → publishes "stock.reserved"
  Shipping Service → schedules delivery
  On failure → each service publishes compensation event
  (e.g., "payment.refund" if inventory fails)

Orchestration-based Saga:
  Central Saga Orchestrator tells each service what to do
  Orchestrator tracks state and handles compensation
  Pro: Easier to follow. Con: Orchestrator is a bottleneck.

CQRS (Command Query Responsibility Segregation)

Separate read model from write model.

Write side (Command):
  Optimised for consistency
  Normalised data, ACID transactions
  Slower queries OK

Read side (Query):
  Optimised for performance
  Denormalised, pre-aggregated views
  Multiple read models for different use cases

Sync via: Event sourcing, CDC (Change Data Capture), Kafka

Example:
  Write: UPDATE orders table (normalised)
  Read:  Pre-computed "user's order history" view in Redis / Elasticsearch

Outbox Pattern

Problem:
  1. Write to DB ✅
  2. Publish to Kafka ❌ (crash) → message lost!

Outbox Solution:
  1. Write to DB + write to outbox_events table (same transaction ✅)
  2. Background worker reads outbox → publishes to Kafka
  3. Mark outbox record as published

Guarantees: At-least-once delivery
Requirement: Make consumers idempotent (safe to process twice)

Tools: Debezium (CDC) can watch the outbox table automatically

Event Sourcing

Instead of storing current state → store sequence of events

Traditional: users table { id, name, balance = 500 }

Event Sourcing:
  Event 1: AccountOpened { id: 1, initial: 1000 }
  Event 2: MoneyWithdrawn { id: 1, amount: 200 }
  Event 3: MoneyDeposited { id: 1, amount: -300 }
  Current balance: 1000 - 200 + (-300) = 500

Benefits:
  ✅ Full audit trail built-in
  ✅ Replay events to rebuild any past state
  ✅ Natural fit for CQRS

Drawbacks:
  ❌ Querying current state requires replaying events (use snapshots)
  ❌ Schema evolution of events is hard

Distributed Tracing

How It Works

Every request gets a unique trace_id
Each service hop creates a span with span_id + parent span_id

Request → API Gateway (trace: abc, span: 1)
            → User Service   (trace: abc, span: 2, parent: 1)
            → DB query       (trace: abc, span: 3, parent: 2)
            → Order Service  (trace: abc, span: 4, parent: 1)

Trace aggregator collects all spans → builds flame chart
→ Instantly see which service is the bottleneck

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM, OpenTelemetry
Standard: OpenTelemetry (vendor-neutral)

Useful Links & Resources