About This Page

Covers microservices architecture and distributed system patterns — service communication, resilience, and data consistency. Parent: System Design. See also: System Design - APIs & Networking, System Design - Databases.

Don't Start with Microservices

Monolith first, then extract services as pain points emerge. Premature microservices = distributed monolith with all the complexity and none of the benefits.

Monolith vs Microservices

Decision Framework

flowchart TD
    Start[New Project?] --> Small{Team < 10\ndevs?}
    Small -->|Yes| Monolith[Start with Monolith ✅]
    Small -->|No| Scale{Scaling\npain points?}
    Scale -->|No| Monolith
    Scale -->|Yes| Extract[Extract Services\nStrangler Fig Pattern]
    Extract --> MS[Microservices 🏗]

Trade-offs Table

Aspect	Monolith	Microservices
Initial complexity	Low	High
Deploy	One artifact	N independent deployments
Scale	Entire app together	Individual services
Fault isolation	None — one bug crashes all	Failures contained per service
Tech stack	Uniform	Polyglot (each service chooses)
Testing	Simple	Complex (integration, contract)
Team model	Feature teams	Service ownership teams
Data	Shared DB (simple)	DB per service (complex)

Service Communication

Sync vs Async

Mode	How	When	Examples
Synchronous	Caller waits for response	User-facing, immediate result needed	REST, gRPC
Asynchronous	Fire and forget via queue	Background jobs, no immediate result needed	Kafka, RabbitMQ, SQS

Message Queue Patterns

Point-to-Point (Queue):
  Producer → [Queue] → Single Consumer
  Message deleted after consumed
  Use: Task queues, job processing

Publish-Subscribe (Topic):
  Producer → [Topic] → Consumer A
                    → Consumer B
                    → Consumer C
  All subscribers receive every message
  Use: Event notifications, fan-out

Example — Order Service:
  Order created → publishes "order.created"
  ├── Email Service       → sends confirmation
  ├── Inventory Service   → reserves stock
  └── Analytics Service   → logs event

Message Queue Comparison

Feature	Kafka	RabbitMQ	AWS SQS
Type	Distributed log	AMQP broker	Managed cloud
Retention	Days/weeks (replay!)	Deleted on consume	Up to 14 days
Throughput	Millions/sec	Tens of thousands/sec	Unlimited
Replay	✅ Any offset	❌	❌
Best for	Event streaming, CQRS, audit	Task queues, RPC	AWS serverless

API Gateway

Responsibilities

flowchart LR
    Client --> GW[API Gateway]
    GW -->|Auth ✓| Auth[Auth Service]
    GW -->|Route| US[User Service]
    GW -->|Route| OS[Order Service]
    GW -->|Route| PS[Product Service]
    GW -->|Rate limit| RL[Rate Limiter]

API Gateway handles:
  ✅ Authentication & Authorization
  ✅ Rate limiting & throttling
  ✅ Request routing & load balancing
  ✅ SSL termination
  ✅ Request/response transformation
  ✅ Logging, metrics, distributed tracing
  ✅ Caching (sometimes)
  ✅ Circuit breaking

Popular: Kong, AWS API Gateway, Nginx, Traefik, Envoy

Service Discovery

Client-Side vs Server-Side

Client-Side Discovery:
  Client → Service Registry → Gets list of instances
  Client picks one (round-robin/random) → Calls directly
  Pro: No extra hop. Con: Discovery logic in every client.
  Tool: Netflix Eureka

Server-Side Discovery:
  Client → Load Balancer → LB queries registry → Routes request
  Pro: Clients stay simple. Con: Extra hop via LB.
  Tool: AWS ALB + ECS service discovery, Kubernetes Service

Service Registry:
  Consul, Eureka, etcd, Zookeeper
  Services register on startup, deregister on shutdown
  Health checks remove unhealthy instances automatically

Resilience Patterns

Circuit Breaker

stateDiagram-v2
    [*] --> CLOSED : Normal
    CLOSED --> OPEN : N failures in window
    OPEN --> HALF_OPEN : Timeout expires
    HALF_OPEN --> CLOSED : Test request ✅
    HALF_OPEN --> OPEN : Test request ❌

CLOSED:    Requests pass through normally
OPEN:      Requests fail fast — no call to failing service
HALF-OPEN: Allow one test request to check recovery

Benefits:
  ✅ Prevents cascading failures
  ✅ Gives failing service time to recover
  ✅ Fast failure (no waiting for timeout)

Libraries: Resilience4j (Java), Polly (.NET), Hystrix (deprecated)

Retry with Exponential Backoff

import time, random
 
def call_with_retry(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # jitter
            time.sleep(wait)
    # Waits: 1s, 2s, 4s, 8s, 16s (+ jitter each time)

Bulkhead Pattern

Isolate thread pools per downstream service.
If Service A is slow and exhausts its pool, Service B/C are unaffected.

Thread Pool Bulkhead:
  Service A calls: pool of 20 threads
  Service B calls: pool of 20 threads
  Service C calls: pool of 20 threads

Semaphore Bulkhead:
  Limit concurrent calls to N (simpler, same thread)

Key insight: Don't let one slow dependency exhaust shared resources.

Distributed Data Patterns

Saga Pattern (Distributed Transactions)

Problem: Distributed transactions across services (2PC is too slow)

Saga = sequence of local transactions with compensating transactions on failure

Choreography-based Saga:
  Order Service → creates order → publishes "order.created"
  Payment Service → charges card → publishes "payment.done"
  Inventory Service → reserves stock → publishes "stock.reserved"
  Shipping Service → schedules delivery
  On failure → each service publishes compensation event
  (e.g., "payment.refund" if inventory fails)

Orchestration-based Saga:
  Central Saga Orchestrator tells each service what to do
  Orchestrator tracks state and handles compensation
  Pro: Easier to follow. Con: Orchestrator is a bottleneck.

CQRS (Command Query Responsibility Segregation)

Separate read model from write model.

Write side (Command):
  Optimised for consistency
  Normalised data, ACID transactions
  Slower queries OK

Read side (Query):
  Optimised for performance
  Denormalised, pre-aggregated views
  Multiple read models for different use cases

Sync via: Event sourcing, CDC (Change Data Capture), Kafka

Example:
  Write: UPDATE orders table (normalised)
  Read:  Pre-computed "user's order history" view in Redis / Elasticsearch

Outbox Pattern

Problem:
  1. Write to DB ✅
  2. Publish to Kafka ❌ (crash) → message lost!

Outbox Solution:
  1. Write to DB + write to outbox_events table (same transaction ✅)
  2. Background worker reads outbox → publishes to Kafka
  3. Mark outbox record as published

Guarantees: At-least-once delivery
Requirement: Make consumers idempotent (safe to process twice)

Tools: Debezium (CDC) can watch the outbox table automatically

Event Sourcing

Instead of storing current state → store sequence of events

Traditional: users table { id, name, balance = 500 }

Event Sourcing:
  Event 1: AccountOpened { id: 1, initial: 1000 }
  Event 2: MoneyWithdrawn { id: 1, amount: 200 }
  Event 3: MoneyDeposited { id: 1, amount: -300 }
  Current balance: 1000 - 200 + (-300) = 500

Benefits:
  ✅ Full audit trail built-in
  ✅ Replay events to rebuild any past state
  ✅ Natural fit for CQRS

Drawbacks:
  ❌ Querying current state requires replaying events (use snapshots)
  ❌ Schema evolution of events is hard

Distributed Tracing

How It Works

Every request gets a unique trace_id
Each service hop creates a span with span_id + parent span_id

Request → API Gateway (trace: abc, span: 1)
            → User Service   (trace: abc, span: 2, parent: 1)
            → DB query       (trace: abc, span: 3, parent: 2)
            → Order Service  (trace: abc, span: 4, parent: 1)

Trace aggregator collects all spans → builds flame chart
→ Instantly see which service is the bottleneck

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM, OpenTelemetry
Standard: OpenTelemetry (vendor-neutral)

Useful Links & Resources

System Design — Hub page
System Design - APIs & Networking — REST, gRPC, WebSockets, rate limiting
Microservices.io — Pattern catalog by Chris Richardson
Resilience4j — Circuit breaker library
Debezium — CDC for outbox pattern
OpenTelemetry — Distributed tracing standard

Code Notes made by Vaibhav Rathod

Explorer

System Design - Microservices

Monolith vs Microservices

Decision Framework

Trade-offs Table

Service Communication

Sync vs Async

Message Queue Patterns

Message Queue Comparison

API Gateway

Responsibilities

Service Discovery

Client-Side vs Server-Side

Resilience Patterns

Circuit Breaker

Retry with Exponential Backoff

Bulkhead Pattern

Distributed Data Patterns

Saga Pattern (Distributed Transactions)

CQRS (Command Query Responsibility Segregation)

Outbox Pattern

Event Sourcing

Distributed Tracing

How It Works

Useful Links & Resources

Graph View

Table of Contents

Backlinks

Recently Updated

System Design

Game Testing & QA - Automated Testing

Game Testing & QA - Bug Tracking

Game Testing & QA - Performance Profiling

Game Testing & QA - Playtesting

Game Testing & QA

System Design - APIs & Networking

System Design - Caching

System Design - Case Studies