About This Page

Covers database design at scale — choosing DBs, replication, sharding, indexing, and ACID. Parent: System Design. See also: System Design - Caching, System Design - Scalability & CAP.

SQL vs NoSQL

Comparison

AspectSQL (Relational)NoSQL
StructureTables, fixed schemaDocument, KV, Column, Graph
ACIDFull ACID transactionsEventual consistency (mostly)
ScalingVertical (primarily)Horizontal (built-in)
JoinsRich multi-table joinsLimited or none
QuerySQL — flexible, powerfulOptimised per access pattern
Use whenComplex relationships, financialHigh write throughput, flexible schema
ExamplesPostgreSQL, MySQL, SQL ServerMongoDB, Cassandra, DynamoDB, Redis

NoSQL Types

TypeStructureBest ForExamples
Key-Valuekey → valueSessions, caching, simple lookupsRedis, DynamoDB
DocumentJSON/BSON documentsFlexible schema, CMS, user profilesMongoDB, CouchDB
Column-familyColumns grouped by familyTime-series, analytics, wide rowsCassandra, HBase
GraphNodes + edgesSocial networks, recommendationsNeo4j, Amazon Neptune
Time-SeriesTimestamped data pointsMetrics, IoT, monitoringInfluxDB, TimescaleDB

How to Choose

Use SQL when:
  ✅ Data has complex relationships (joins needed)
  ✅ Strong consistency is required (financial, e-commerce)
  ✅ Schema is stable and well-defined
  ✅ Reporting / analytics queries needed

Use NoSQL when:
  ✅ Need horizontal scale with massive write throughput
  ✅ Data is unstructured or schema evolves frequently
  ✅ Low-latency key-value lookups (caching, sessions)
  ✅ Time-series or event streams

Don't choose NoSQL just because it's "modern".
PostgreSQL handles billions of rows with proper indexing + sharding.

Database Replication

Primary-Replica (Master-Slave)

flowchart LR
    App --> LB[Write → Primary]
    App --> LB2[Read → Replica]
    Primary["Primary DB\n(Writes)"] -->|replicate| R1["Replica 1\n(Reads)"]
    Primary -->|replicate| R2["Replica 2\n(Reads)"]
Benefits:
  ✅ Read scalability — distribute reads across replicas
  ✅ Failover — promote replica if primary fails
  ✅ Backups — take snapshots from replica (no impact to primary)

Replication lag:
  Async replication → replicas may be seconds behind
  Sync replication  → consistent but slower writes
  Semi-sync         → wait for ≥1 replica (balance)

Primary-Primary (Multi-Master)

Both nodes accept reads AND writes.
Sync via bidirectional replication.

Pros: High write availability, geographic distribution
Cons: Conflict resolution complexity (last-write-wins? custom?)

Use when: Multiple datacenters, each needing local writes

Database Sharding

Sharding Strategies

StrategyKeyHot Spots?Range Queries?Resharding?
Rangeuser_id rangesYes (popular ranges)✅ EasyHard
Hashhash(id) % NNo❌ HardVery Hard
DirectoryLookup tableNoDependsEasy
GeographicRegion/countryMaybeDependsMedium

Consistent Hashing for Shards

Problem: hash(key) % N breaks when N changes

Solution: Consistent hashing ring
  → Add server: only remap 1/N keys
  → Remove server: only 1/N keys move to neighbours
  → Used in: Cassandra, DynamoDB, Redis Cluster

See [[System Design - Scalability & CAP]] for full explanation.

Sharding Challenges

Cross-shard joins:     Application-level join (expensive)
Cross-shard transactions: 2PC or Saga pattern
Resharding:           Move data while serving traffic (blue-green)
Hot shards:           Consistent hashing + virtual nodes
ID generation:        Snowflake ID / UUID (no auto-increment)

Database Indexing

Index Types

Index TypeHowBest ForAvoid When
B-Tree (default)Balanced treeRange queries, ORDER BY, equalityHigh-cardinality equality only
HashHash tableExact match O(1)Range queries
CompositeMultiple columns togetherMulti-column WHERE clausesWrong column order
CoveringIndex includes all queried columnsAvoid table lookup entirelyWrite-heavy tables
Full-TextInverted indexLIKE ‘%keyword%’ searchStructured data
PartialIndex subset of rowsstatus = ‘active’ queriesRarely filtered columns

Index Best Practices

DO:
  ✅ Index columns in WHERE, JOIN ON, ORDER BY
  ✅ Use composite index — most selective column first
  ✅ Use covering index for hot read paths
  ✅ EXPLAIN / EXPLAIN ANALYZE to verify index is used

DON'T:
  ❌ Index every column — writes become slow
  ❌ Index low-cardinality columns (gender: M/F) — not selective
  ❌ Ignore index bloat — VACUUM / ANALYZE periodically
  ❌ Left-prefix rule violation: INDEX(a,b,c) → WHERE b=? doesn't use index

Composite index left-prefix rule:
  INDEX (last_name, first_name, age)
  ✅ WHERE last_name = ?
  ✅ WHERE last_name = ? AND first_name = ?
  ❌ WHERE first_name = ?          -- skips left column
  ❌ WHERE age = ?                 -- skips left two columns

ACID Transactions

Properties

PropertyMeaningExample
AtomicityAll-or-nothing. Partial failure = full rollback.Bank transfer: debit + credit both succeed or neither.
ConsistencyDB moves valid state → valid state. Constraints hold.Balance can’t go negative if constraint exists.
IsolationConcurrent transactions don’t interfere.Two users booking last seat — only one wins.
DurabilityCommitted data survives crashes.Power outage after commit — data still there.

Isolation Levels

LevelDirty ReadNon-RepeatablePhantom ReadPerformance
READ UNCOMMITTED✅ PossibleFastest
READ COMMITTED❌ No✅ PossibleFast
REPEATABLE READ❌ No✅ PossibleMedium
SERIALIZABLE❌ NoSlowest

Connection Pooling

Why Pooling?

Opening a DB connection = ~50–100ms overhead

Without pool:
  Request → Open connection → Query → Close → ~100ms wasted

With pool:
  App starts → Pre-open 10–100 connections
  Request → Borrow connection → Query → Return to pool
  → Near 0ms overhead

Popular poolers:
  PgBouncer        — PostgreSQL connection pooler
  HikariCP (Java)  — Fastest JVM pool
  SQLAlchemy pool  — Python
  pgpool-II        — Load balancing + pooling for PostgreSQL

Key Pool Settings

min_connections:    5   (always-open minimum)
max_connections:  100   (upper limit — don't exceed DB's max_connections)
connection_timeout: 30s  (how long to wait for available connection)
idle_timeout:      600s  (close idle connections after this)
max_lifetime:     1800s  (recycle connections to prevent stale state)

Read/Write Splitting

Pattern

# Application-level routing example
def get_db_connection(query_type: str):
    if query_type == "write":
        return connect(PRIMARY_DB_HOST)
    else:
        return connect(REPLICA_DB_HOST)  # round-robin if multiple replicas
 
# ORM-level (SQLAlchemy)
engine = create_engine(...)
read_engine = create_engine(REPLICA_URL)
 
with read_engine.connect() as conn:
    result = conn.execute(select(User))  # goes to replica

Trade-offs

Replication lag risk:
  Write → primary → replicate → replica (takes ms to seconds)
  If user reads immediately after writing → may see stale data

Mitigation:
  Read-your-own-writes: route reads to primary for 1s after write
  Sticky sessions: same user always reads from same replica
  Sync replication: higher latency but no lag

Useful Links & Resources