Databases and Data Modeling
For the first few years of my career, I treated databases as magic boxes. I wrote ORM queries, got data back, and never thought about what happened inside. Then I shipped a feature that worked perfectly with 100 rows and collapsed at 100,000—a 6-second page load caused by a missing index and an accidental N+1 query. Fixing it took 10 minutes once I understood what was happening. Not understanding it cost me an afternoon of confusion.
Databases are the backbone of almost every application. Understanding how they store, retrieve, and protect your data isn't just for DBAs—it's essential for any engineer who writes queries, designs schemas, or builds APIs.
Relational Fundamentals
Relational databases (PostgreSQL, MySQL, SQLite) organize data into tables (relations). Each table has columns (attributes) and rows (tuples). The power of the relational model is that relationships between tables are expressed through shared values, not embedded structures.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total DECIMAL(10, 2) NOT NULL,
status TEXT DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT now()
);Keys
- Primary key: Uniquely identifies each row. Usually an auto-incrementing integer or a UUID.
- Foreign key: References a primary key in another table.
orders.user_idreferencesusers.id. The database enforces this—you can't insert an order for a user that doesn't exist. - Unique key: Ensures no two rows have the same value in a column (
email UNIQUE).
Normalization
Normalization is the process of organizing data to reduce redundancy and prevent anomalies.
First Normal Form (1NF): Each column holds a single value (no arrays, no comma-separated lists).
-- Bad: violates 1NF
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT,
phone_numbers TEXT -- '555-0100,555-0101,555-0102'
);
-- Good: separate table
CREATE TABLE phone_numbers (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
number TEXT NOT NULL
);Second Normal Form (2NF): Every non-key column depends on the entire primary key (relevant for composite keys).
Third Normal Form (3NF): No non-key column depends on another non-key column. If city determines state, store them in a separate cities table.
In practice: Normalize by default, denormalize when you have a measured performance need. Over-normalization leads to excessive joins. Under-normalization leads to data inconsistencies.
Joins
Joins combine rows from two or more tables based on a related column.
-- INNER JOIN: only matching rows from both tables
SELECT u.name, o.total, o.status
FROM users u
INNER JOIN orders o ON o.user_id = u.id;
-- LEFT JOIN: all users, even those with no orders (NULL for order columns)
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;| Join type | Returns |
|---|---|
| INNER JOIN | Only rows with matches in both tables |
| LEFT JOIN | All rows from left table, matched rows from right (NULL if no match) |
| RIGHT JOIN | All rows from right table, matched rows from left |
| FULL OUTER JOIN | All rows from both tables |
| CROSS JOIN | Every combination (cartesian product) |
The N+1 problem: The most common performance mistake with ORMs.
// N+1: 1 query for users + N queries for orders (one per user)
const users = await db.query("SELECT * FROM users")
for (const user of users) {
user.orders = await db.query("SELECT * FROM orders WHERE user_id = $1", [
user.id,
])
}
// Fix: 1 query with a join or 2 queries with an IN clause
const users = await db.query(`
SELECT u.*, json_agg(o.*) as orders
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id
`)If you have 1,000 users, the N+1 version fires 1,001 queries. The join version fires 1. The database is fast at joining—let it do the work.
Indexing
An index is a data structure (typically a B-tree) that the database builds on one or more columns to speed up lookups. Without an index, the database must scan every row in the table (full table scan). With an index, it can jump directly to the relevant rows.
-- Without index: O(n) full table scan
SELECT * FROM orders WHERE user_id = 42;
-- Create an index
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- With index: O(log n) B-tree lookup
SELECT * FROM orders WHERE user_id = 42;How B-Tree Indexes Work
A B-tree is a balanced tree where each node contains many keys and pointers. A B-tree index on user_id looks roughly like:
[50 | 100]
/ | \
[10|25|40] [60|75] [110|150|200]
↓ ↓ ↓
(data pages with actual rows)
To find user_id = 75, the database starts at the root, navigates to the middle child, and finds the leaf page containing that row. For a million rows, this takes ~3-4 page reads instead of scanning all pages.
When to Create Indexes
- Columns used in WHERE clauses (especially equality and range comparisons).
- Columns used in JOIN conditions (foreign keys).
- Columns used in ORDER BY (the index provides sorted order for free).
When NOT to Create Indexes
- Write-heavy tables with few reads. Every INSERT/UPDATE must also update the index.
- Low-cardinality columns (e.g., a
statuscolumn with 3 possible values). The index doesn't help much when most rows match. - Small tables. Scanning 100 rows is faster than an index lookup.
Composite Indexes
An index on (user_id, created_at) helps queries that filter on both columns, or on user_id alone (leftmost prefix). It does NOT help queries that only filter on created_at.
-- Uses the composite index
SELECT * FROM orders WHERE user_id = 42 AND created_at > '2025-01-01';
SELECT * FROM orders WHERE user_id = 42;
-- Does NOT use the composite index (missing leftmost column)
SELECT * FROM orders WHERE created_at > '2025-01-01';Transactions and Isolation Levels
A transaction groups multiple operations into an atomic unit. Either all operations succeed, or none do.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;If the server crashes between the two updates, the transaction is rolled back. No money disappears. This is the A in ACID (Atomicity, Consistency, Isolation, Durability).
Isolation Levels
When multiple transactions run concurrently, isolation levels determine what they can see of each other's uncommitted changes.
| Level | Dirty reads | Non-repeatable reads | Phantom reads |
|---|---|---|---|
| READ UNCOMMITTED | Yes | Yes | Yes |
| READ COMMITTED | No | Yes | Yes |
| REPEATABLE READ | No | No | Yes |
| SERIALIZABLE | No | No | No |
- Dirty read: You see another transaction's uncommitted changes (which might be rolled back).
- Non-repeatable read: You read the same row twice and get different values because another transaction committed a change.
- Phantom read: You run the same query twice and get different rows because another transaction inserted/deleted rows.
PostgreSQL defaults to READ COMMITTED. Most applications are fine with this. Use SERIALIZABLE when you need absolute correctness (financial calculations) and can handle the performance cost (more lock contention, potential transaction retries).
Query Planning and EXPLAIN
The query planner is the database's optimizer. It takes your SQL and decides the most efficient execution strategy: which indexes to use, what join algorithm to pick, whether to sort.
EXPLAIN ANALYZE
SELECT u.name, count(o.id) as order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.created_at > '2025-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 10;The output shows:
- Seq Scan vs Index Scan: Sequential scan reads every row. Index scan uses an index. If you see Seq Scan on a large table where you expected an index, you're missing an index or the planner decided it wasn't worth it.
- Nested Loop vs Hash Join vs Merge Join: How the database combines tables. Hash joins are typically fastest for large tables. Nested loops are fine for small results with indexed lookups.
- Actual time: Real execution time in milliseconds.
- Rows: Estimated vs actual row counts. Large discrepancies indicate stale statistics (
ANALYZEyour tables).
Sort (cost=234.56..234.78 rows=10)
-> HashAggregate (cost=200.00..210.00 rows=100)
-> Hash Join (cost=50.00..180.00 rows=5000)
Hash Cond: (o.user_id = u.id)
-> Index Scan using idx_orders_created on orders o
Index Cond: (created_at > '2025-01-01')
-> Hash (cost=25.00..25.00 rows=1000)
-> Seq Scan on users u
Reading query plans is like reading a profiler output. It tells you where time is spent and what the database chose to do. Learn to read them and you'll fix slow queries in minutes instead of hours.
NoSQL Databases
Not everything fits neatly into tables. NoSQL databases trade some relational guarantees for flexibility, scalability, or performance in specific access patterns.
Document Stores (MongoDB, CouchDB)
Store data as flexible JSON-like documents. No fixed schema—each document can have different fields.
{
"_id": "user_123",
"name": "Alice",
"email": "alice@example.com",
"orders": [
{ "total": 59.99, "status": "shipped" },
{ "total": 124.0, "status": "delivered" }
]
}Good for: Rapid prototyping, content management, catalogs, data that varies per record. Embedding related data (orders inside user) eliminates joins for common access patterns.
Tradeoff: No joins, limited transactions (historically), denormalization leads to data duplication.
Key-Value Stores (Redis, DynamoDB)
The simplest model: a key maps to a value. Extremely fast for lookups by key.
"session:abc123" → {"userId": 42, "expires": "2025-12-31"}
"cache:user:42" → {"name": "Alice", "role": "admin"}
Good for: Caching, session storage, rate limiting, leaderboards, pub/sub messaging.
Column-Oriented (Cassandra, ClickHouse)
Store data by column instead of by row. Reading all values of a single column (e.g., all created_at timestamps) is fast because they're stored contiguously. Aggregations over large datasets are efficient.
Good for: Analytics, time-series data, write-heavy workloads with simple read patterns.
When to Use What
| Need | Choose |
|---|---|
| Complex queries, joins, transactions | Relational (PostgreSQL) |
| Flexible schema, embedded documents | Document (MongoDB) |
| Fast key lookups, caching | Key-value (Redis) |
| Large-scale analytics | Column-oriented (ClickHouse) |
| Graph relationships | Graph (Neo4j) |
Most applications should start with a relational database. It's the most flexible, well-understood, and well-tooled option. Add specialized databases when you have a specific need that relational doesn't serve well.
CAP Theorem
The CAP theorem states that a distributed database can provide at most two of three guarantees:
- Consistency: Every read returns the most recent write.
- Availability: Every request gets a response (no timeouts).
- Partition tolerance: The system works despite network failures between nodes.
Since network partitions are unavoidable in distributed systems, the real choice is between consistency and availability when a partition occurs:
- CP systems (PostgreSQL with synchronous replication, MongoDB with majority write concern): During a partition, some requests may fail to maintain consistency.
- AP systems (Cassandra, DynamoDB): During a partition, all nodes respond but may return stale data.
In practice, CAP is a spectrum, not a binary choice. Most systems let you tune the consistency level per operation. Understanding the tradeoff helps you choose the right configuration for each use case: financial transactions need CP, analytics dashboards can tolerate AP.
The Pragmatic Takeaway
Databases are not magic boxes. They're engineering systems with specific behaviors, tradeoffs, and failure modes. The same query can take 5ms or 5 seconds depending on whether an index exists. The same schema can be elegant or nightmarish depending on normalization decisions.
Learn to read EXPLAIN ANALYZE output. Understand why your ORM generates the SQL it does. Know the difference between a hash join and a nested loop. These aren't DBA skills—they're application developer skills that prevent the most common class of production performance problems.
Start with PostgreSQL, learn it deeply, and reach for specialized databases only when you have a need that PostgreSQL genuinely can't serve. Most of the time, a well-designed relational schema with proper indexes outperforms the NoSQL alternative for general application workloads.