Databases and Data Modeling

Mar 18, 2026

Computer Science

For the first few years of my career, I treated databases as magic boxes. I wrote ORM queries, got data back, and never thought about what happened inside. Then I shipped a feature that worked perfectly with 100 rows and collapsed at 100,000—a 6-second page load caused by a missing index and an accidental N+1 query. Fixing it took 10 minutes once I understood what was happening. Not understanding it cost me an afternoon of confusion.

Databases are the backbone of almost every application. Understanding how they store, retrieve, and protect your data isn't just for DBAs—it's essential for any engineer who writes queries, designs schemas, or builds APIs.

Relational Fundamentals

Relational databases (PostgreSQL, MySQL, SQLite) organize data into tables (relations). Each table has columns (attributes) and rows (tuples). The power of the relational model is that relationships between tables are expressed through shared values, not embedded structures.

CREATE TABLE users (
  id         SERIAL PRIMARY KEY,
  email      TEXT UNIQUE NOT NULL,
  name       TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);
 
CREATE TABLE orders (
  id         SERIAL PRIMARY KEY,
  user_id    INTEGER REFERENCES users(id),
  total      DECIMAL(10, 2) NOT NULL,
  status     TEXT DEFAULT 'pending',
  created_at TIMESTAMPTZ DEFAULT now()
);

Keys

Primary key: Uniquely identifies each row. Usually an auto-incrementing integer or a UUID.
Foreign key: References a primary key in another table. orders.user_id references users.id. The database enforces this—you can't insert an order for a user that doesn't exist.
Unique key: Ensures no two rows have the same value in a column (email UNIQUE).

Normalization

Normalization is the process of organizing data to reduce redundancy and prevent anomalies.

First Normal Form (1NF): Each column holds a single value (no arrays, no comma-separated lists).

-- Bad: violates 1NF
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT,
  phone_numbers TEXT  -- '555-0100,555-0101,555-0102'
);
 
-- Good: separate table
CREATE TABLE phone_numbers (
  id SERIAL PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  number TEXT NOT NULL
);

Second Normal Form (2NF): Every non-key column depends on the entire primary key (relevant for composite keys).

Third Normal Form (3NF): No non-key column depends on another non-key column. If city determines state, store them in a separate cities table.

In practice: Normalize by default, denormalize when you have a measured performance need. Over-normalization leads to excessive joins. Under-normalization leads to data inconsistencies.

Joins

Joins combine rows from two or more tables based on a related column.

-- INNER JOIN: only matching rows from both tables
SELECT u.name, o.total, o.status
FROM users u
INNER JOIN orders o ON o.user_id = u.id;
 
-- LEFT JOIN: all users, even those with no orders (NULL for order columns)
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;

Join type	Returns
INNER JOIN	Only rows with matches in both tables
LEFT JOIN	All rows from left table, matched rows from right (NULL if no match)
RIGHT JOIN	All rows from right table, matched rows from left
FULL OUTER JOIN	All rows from both tables
CROSS JOIN	Every combination (cartesian product)

The N+1 problem: The most common performance mistake with ORMs.

// N+1: 1 query for users + N queries for orders (one per user)
const users = await db.query("SELECT * FROM users")
for (const user of users) {
  user.orders = await db.query("SELECT * FROM orders WHERE user_id = $1", [
    user.id,
  ])
}
 
// Fix: 1 query with a join or 2 queries with an IN clause
const users = await db.query(`
  SELECT u.*, json_agg(o.*) as orders
  FROM users u
  LEFT JOIN orders o ON o.user_id = u.id
  GROUP BY u.id
`)

If you have 1,000 users, the N+1 version fires 1,001 queries. The join version fires 1. The database is fast at joining—let it do the work.

Indexing

An index is a data structure (typically a B-tree) that the database builds on one or more columns to speed up lookups. Without an index, the database must scan every row in the table (full table scan). With an index, it can jump directly to the relevant rows.

-- Without index: O(n) full table scan
SELECT * FROM orders WHERE user_id = 42;
 
-- Create an index
CREATE INDEX idx_orders_user_id ON orders(user_id);
 
-- With index: O(log n) B-tree lookup
SELECT * FROM orders WHERE user_id = 42;

How B-Tree Indexes Work

A B-tree is a balanced tree where each node contains many keys and pointers. A B-tree index on user_id looks roughly like:

            [50 | 100]
           /    |      \
    [10|25|40] [60|75] [110|150|200]
        ↓        ↓        ↓
    (data pages with actual rows)

To find user_id = 75, the database starts at the root, navigates to the middle child, and finds the leaf page containing that row. For a million rows, this takes ~3-4 page reads instead of scanning all pages.

When to Create Indexes

Columns used in WHERE clauses (especially equality and range comparisons).
Columns used in JOIN conditions (foreign keys).
Columns used in ORDER BY (the index provides sorted order for free).

When NOT to Create Indexes

Write-heavy tables with few reads. Every INSERT/UPDATE must also update the index.
Low-cardinality columns (e.g., a status column with 3 possible values). The index doesn't help much when most rows match.
Small tables. Scanning 100 rows is faster than an index lookup.

Composite Indexes

An index on (user_id, created_at) helps queries that filter on both columns, or on user_id alone (leftmost prefix). It does NOT help queries that only filter on created_at.

-- Uses the composite index
SELECT * FROM orders WHERE user_id = 42 AND created_at > '2025-01-01';
SELECT * FROM orders WHERE user_id = 42;
 
-- Does NOT use the composite index (missing leftmost column)
SELECT * FROM orders WHERE created_at > '2025-01-01';

Transactions and Isolation Levels

A transaction groups multiple operations into an atomic unit. Either all operations succeed, or none do.

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

If the server crashes between the two updates, the transaction is rolled back. No money disappears. This is the A in ACID (Atomicity, Consistency, Isolation, Durability).

Isolation Levels

When multiple transactions run concurrently, isolation levels determine what they can see of each other's uncommitted changes.

Level	Dirty reads	Non-repeatable reads	Phantom reads
READ UNCOMMITTED	Yes	Yes	Yes
READ COMMITTED	No	Yes	Yes
REPEATABLE READ	No	No	Yes
SERIALIZABLE	No	No	No

Dirty read: You see another transaction's uncommitted changes (which might be rolled back).
Non-repeatable read: You read the same row twice and get different values because another transaction committed a change.
Phantom read: You run the same query twice and get different rows because another transaction inserted/deleted rows.

PostgreSQL defaults to READ COMMITTED. Most applications are fine with this. Use SERIALIZABLE when you need absolute correctness (financial calculations) and can handle the performance cost (more lock contention, potential transaction retries).

Query Planning and EXPLAIN

The query planner is the database's optimizer. It takes your SQL and decides the most efficient execution strategy: which indexes to use, what join algorithm to pick, whether to sort.

EXPLAIN ANALYZE
SELECT u.name, count(o.id) as order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.created_at > '2025-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 10;

The output shows:

Seq Scan vs Index Scan: Sequential scan reads every row. Index scan uses an index. If you see Seq Scan on a large table where you expected an index, you're missing an index or the planner decided it wasn't worth it.
Nested Loop vs Hash Join vs Merge Join: How the database combines tables. Hash joins are typically fastest for large tables. Nested loops are fine for small results with indexed lookups.
Actual time: Real execution time in milliseconds.
Rows: Estimated vs actual row counts. Large discrepancies indicate stale statistics (ANALYZE your tables).

Sort  (cost=234.56..234.78 rows=10)
  ->  HashAggregate  (cost=200.00..210.00 rows=100)
        ->  Hash Join  (cost=50.00..180.00 rows=5000)
              Hash Cond: (o.user_id = u.id)
              ->  Index Scan using idx_orders_created on orders o
                    Index Cond: (created_at > '2025-01-01')
              ->  Hash  (cost=25.00..25.00 rows=1000)
                    ->  Seq Scan on users u

Reading query plans is like reading a profiler output. It tells you where time is spent and what the database chose to do. Learn to read them and you'll fix slow queries in minutes instead of hours.

NoSQL Databases

Not everything fits neatly into tables. NoSQL databases trade some relational guarantees for flexibility, scalability, or performance in specific access patterns.

Document Stores (MongoDB, CouchDB)

Store data as flexible JSON-like documents. No fixed schema—each document can have different fields.

{
  "_id": "user_123",
  "name": "Alice",
  "email": "alice@example.com",
  "orders": [
    { "total": 59.99, "status": "shipped" },
    { "total": 124.0, "status": "delivered" }
  ]
}

Good for: Rapid prototyping, content management, catalogs, data that varies per record. Embedding related data (orders inside user) eliminates joins for common access patterns.

Tradeoff: No joins, limited transactions (historically), denormalization leads to data duplication.

Key-Value Stores (Redis, DynamoDB)

The simplest model: a key maps to a value. Extremely fast for lookups by key.

"session:abc123" → {"userId": 42, "expires": "2025-12-31"}
"cache:user:42"  → {"name": "Alice", "role": "admin"}

Good for: Caching, session storage, rate limiting, leaderboards, pub/sub messaging.

Column-Oriented (Cassandra, ClickHouse)

Store data by column instead of by row. Reading all values of a single column (e.g., all created_at timestamps) is fast because they're stored contiguously. Aggregations over large datasets are efficient.

Good for: Analytics, time-series data, write-heavy workloads with simple read patterns.

When to Use What

Need	Choose
Complex queries, joins, transactions	Relational (PostgreSQL)
Flexible schema, embedded documents	Document (MongoDB)
Fast key lookups, caching	Key-value (Redis)
Large-scale analytics	Column-oriented (ClickHouse)
Graph relationships	Graph (Neo4j)

Most applications should start with a relational database. It's the most flexible, well-understood, and well-tooled option. Add specialized databases when you have a specific need that relational doesn't serve well.

CAP Theorem

The CAP theorem states that a distributed database can provide at most two of three guarantees:

Consistency: Every read returns the most recent write.
Availability: Every request gets a response (no timeouts).
Partition tolerance: The system works despite network failures between nodes.

Since network partitions are unavoidable in distributed systems, the real choice is between consistency and availability when a partition occurs:

CP systems (PostgreSQL with synchronous replication, MongoDB with majority write concern): During a partition, some requests may fail to maintain consistency.
AP systems (Cassandra, DynamoDB): During a partition, all nodes respond but may return stale data.

In practice, CAP is a spectrum, not a binary choice. Most systems let you tune the consistency level per operation. Understanding the tradeoff helps you choose the right configuration for each use case: financial transactions need CP, analytics dashboards can tolerate AP.

The Pragmatic Takeaway

Databases are not magic boxes. They're engineering systems with specific behaviors, tradeoffs, and failure modes. The same query can take 5ms or 5 seconds depending on whether an index exists. The same schema can be elegant or nightmarish depending on normalization decisions.

Learn to read EXPLAIN ANALYZE output. Understand why your ORM generates the SQL it does. Know the difference between a hash join and a nested loop. These aren't DBA skills—they're application developer skills that prevent the most common class of production performance problems.

Start with PostgreSQL, learn it deeply, and reach for specialized databases only when you have a need that PostgreSQL genuinely can't serve. Most of the time, a well-designed relational schema with proper indexes outperforms the NoSQL alternative for general application workloads.