Computer Architecture and Memory
I used to think hardware was irrelevant to a web developer. I write JavaScript—what do I care about CPU caches? Then I profiled a function that processed a large dataset and discovered that changing the data layout—same algorithm, same Big O—made it 4x faster. The reason: cache locality. The CPU was spending more time fetching data from memory than doing actual computation. That experience convinced me that understanding the hardware beneath your code isn't optional, even if you never write assembly.
This post covers the architecture concepts that leak up through every abstraction layer and affect the performance of code at every level.
The CPU
A CPU (Central Processing Unit) is the brain of the computer. At its core, it does three things: fetch an instruction, decode it, and execute it. It does this billions of times per second.
Cores and Clock Speed
Modern CPUs have multiple cores, each capable of executing instructions independently. A 4-core CPU can run 4 threads truly in parallel.
Clock speed (measured in GHz) determines how many cycles per second a core can execute. A 3 GHz core does 3 billion cycles per second. Simple operations (integer addition, comparison) take 1 cycle. More complex operations (division, memory access) take many cycles.
More cores vs faster cores: For single-threaded code (most JavaScript), a faster single core matters more. For parallel workloads (video encoding, compilation), more cores matter more. This is why Apple's M-series chips use a mix of performance cores (fast, power-hungry) and efficiency cores (slow, power-efficient).
Registers
Registers are the fastest storage in the CPU—tiny (a few hundred bytes total), on-chip memory that the CPU accesses in a single cycle. When the CPU adds two numbers, both numbers must be in registers. Everything else (RAM, disk) is accessed indirectly.
Modern x86-64 CPUs have ~16 general-purpose registers, each 64 bits wide. That's 128 bytes for the CPU's "working memory." Everything your program does eventually passes through these few registers.
The Memory Hierarchy
The fundamental tension in computer architecture is that fast memory is small and expensive, while large memory is slow and cheap. The solution is a hierarchy:
┌──────────────────────────────────────────────┐
│ Registers ~1 cycle ~hundreds bytes │
├──────────────────────────────────────────────┤
│ L1 Cache ~4 cycles ~64 KB │
├──────────────────────────────────────────────┤
│ L2 Cache ~12 cycles ~256 KB-1 MB │
├──────────────────────────────────────────────┤
│ L3 Cache ~40 cycles ~8-64 MB │
├──────────────────────────────────────────────┤
│ Main RAM ~200 cycles ~8-128 GB │
├──────────────────────────────────────────────┤
│ SSD ~10,000+ ~256 GB-4 TB │
├──────────────────────────────────────────────┤
│ HDD ~1,000,000+ ~1-20 TB │
└──────────────────────────────────────────────┘
The numbers are approximate, but the ratios are real. Accessing L1 cache is ~50x faster than accessing RAM. Accessing RAM is ~1,000x faster than accessing an SSD. This means a cache miss (data not in cache, must go to RAM) is not a minor penalty—it's a 50x slowdown for that operation.
CPU Caches
Caches are small, fast memory that sit between the CPU and RAM. They automatically store recently accessed data and data that's likely to be accessed next.
How Caches Work
When the CPU reads a memory address, it checks L1 first, then L2, then L3, then RAM. If the data is found in cache, it's a cache hit (fast). If not, it's a cache miss (slow—the CPU stalls while waiting for data from the next level).
Caches don't load individual bytes. They load cache lines—typically 64 bytes at a time. When you access array[0], the cache loads bytes 0-63. If array[1] is within those 64 bytes, it's already in cache. This is why sequential access patterns are fast and random access patterns are slow.
Cache Locality
Spatial locality: If you access address X, you'll likely access addresses near X soon. Arrays have great spatial locality—elements are stored contiguously, so accessing one loads nearby elements into cache.
Temporal locality: If you access address X, you'll likely access it again soon. Loop variables, frequently called functions, and hot data all benefit from temporal locality.
// Good cache locality — sequential access through contiguous memory
const arr = new Float64Array(1_000_000)
let sum = 0
for (let i = 0; i < arr.length; i++) {
sum += arr[i] // each access loads nearby elements into cache
}
// Poor cache locality — random access, cache lines wasted
const indices = shuffle([...Array(1_000_000).keys()])
let sum2 = 0
for (const i of indices) {
sum2 += arr[i] // each access likely misses cache
}Both loops do the same work (sum all elements) with the same Big O (O(n)). But the sequential version can be 5-10x faster because of cache effects. The random-access version triggers a cache miss on nearly every access.
Data Layout Matters
The way you organize data in memory directly affects cache performance.
Array of Structs vs Struct of Arrays
Consider processing a list of particles, each with position (x, y) and velocity (vx, vy):
Array of Structs (AoS):
// Each particle is an object — scattered on the heap
const particles = [
{ x: 1, y: 2, vx: 0.1, vy: 0.2 },
{ x: 3, y: 4, vx: 0.3, vy: 0.4 },
// ...
]
// Update positions
for (const p of particles) {
p.x += p.vx
p.y += p.vy
}Struct of Arrays (SoA):
// Each property in its own contiguous array
const x = new Float64Array(count)
const y = new Float64Array(count)
const vx = new Float64Array(count)
const vy = new Float64Array(count)
// Update positions
for (let i = 0; i < count; i++) {
x[i] += vx[i]
y[i] += vy[i]
}The SoA version is often significantly faster for bulk operations. When updating x-positions, only the x and vx arrays are accessed—they're contiguous in memory, so cache lines are fully utilized. In the AoS version, each object is a separate heap allocation, and loading one particle's x also loads its y, vx, vy into the cache line, wasting cache space if you only need x and vx.
This pattern shows up in game engines, physics simulations, and data processing. The ECS (Entity Component System) pattern popular in game development is essentially SoA applied to game objects.
The Instruction Pipeline
Modern CPUs don't execute one instruction at a time. They use a pipeline—multiple instructions in different stages of execution simultaneously.
Clock: 1 2 3 4 5 6
Inst 1: F D E W
Inst 2: F D E W
Inst 3: F D E W
F = Fetch, D = Decode, E = Execute, W = Write back
While instruction 1 is executing, instruction 2 is being decoded, and instruction 3 is being fetched. This means the CPU effectively completes one instruction per cycle, even though each instruction takes multiple cycles to complete.
Pipeline stalls occur when the next instruction depends on the result of the current one, or when a branch (if/else) makes the CPU uncertain about what to fetch next.
Branch Prediction
When the CPU encounters a conditional branch (if, switch, loop condition), it doesn't know which path to take until the condition is evaluated. Rather than stalling the pipeline, the CPU predicts which branch will be taken and speculatively executes that path. If the prediction is correct, execution continues seamlessly. If wrong, the pipeline must be flushed and restarted—a penalty of ~15-20 cycles.
Modern branch predictors are remarkably accurate (>95% for typical code), but unpredictable branches hurt:
// Predictable — branch predictor learns the pattern quickly
const sorted = data.toSorted((a, b) => a - b)
let sum = 0
for (const val of sorted) {
if (val >= 128) sum += val // after a point, always true
}
// Unpredictable — branch predictor fails ~50% of the time
const shuffled = shuffle([...data])
let sum2 = 0
for (const val of shuffled) {
if (val >= 128) sum2 += val // random true/false
}On sorted data, the branch transitions from "always false" to "always true" exactly once. The predictor handles this easily. On shuffled data, the branch is essentially random, causing frequent mispredictions. The sorted version can be 2-5x faster despite identical algorithmic work.
Branchless programming avoids this by converting conditionals to arithmetic:
// Branchless equivalent: val >= 128 ? val : 0
sum += val * (val >= 128)This matters in performance-critical inner loops (game engines, signal processing, database engines). For application code, it's rarely worth the readability cost.
Instruction-Level Parallelism
Modern CPUs can execute multiple independent instructions simultaneously, even within a single core. This is called superscalar execution.
// These operations are independent — CPU can execute them in parallel
const a = x + 1
const b = y * 2
const c = z - 3
// This chain is dependent — each must wait for the previous
const a = x + 1
const b = a * 2 // depends on a
const c = b - 3 // depends on bThe compiler and CPU reorder and parallelize independent operations automatically. You rarely need to think about this, but it explains why breaking data dependencies (using separate accumulators in a reduction, for example) can speed up tight loops.
SIMD — Single Instruction, Multiple Data
SIMD instructions operate on multiple values simultaneously. Instead of adding two numbers, a SIMD instruction adds four (or eight, or sixteen) pairs of numbers in a single cycle.
Normal: a₁ + b₁ = c₁ (1 operation per cycle)
SIMD: a₁ + b₁ = c₁
a₂ + b₂ = c₂ (4 operations per cycle)
a₃ + b₃ = c₃
a₄ + b₄ = c₄
You won't write SIMD instructions in JavaScript, but you benefit from them indirectly. TypedArrays (Float64Array, Int32Array) give the engine a chance to use SIMD because the data is densely packed with known types. V8 and other engines apply SIMD optimizations to certain array operations when possible.
WebAssembly has explicit SIMD support, which is why performance-critical libraries (image processing, physics, codecs) compile to Wasm.
How This Affects Your Code
You don't need to write assembly or count cache lines. But these concepts explain real-world performance behaviors:
| Observation | Architecture explanation |
|---|---|
| Processing a TypedArray is faster than a regular Array | Contiguous memory, cache-friendly, SIMD-eligible |
| Sorted data processes faster in some algorithms | Branch prediction succeeds on sorted patterns |
| Object-heavy code has higher GC pressure | Many small heap allocations, poor locality |
| First request to an API is slow, subsequent are fast | Page cache (disk → RAM), CPU cache warmup |
| Column-oriented databases are fast for analytics | Each column is contiguous — cache-friendly scans |
| B-tree indexes use large nodes | Minimize disk reads by matching disk page size |
| Redis is fast | Data is entirely in RAM — no disk latency |
The Pragmatic Takeaway
The memory hierarchy is the single most important architecture concept for software performance. The difference between L1 cache and RAM is 50x. The difference between RAM and SSD is 1,000x. Every abstraction you use—from JavaScript arrays to database indexes to HTTP caches—is designed to keep frequently accessed data as close to the CPU as possible.
When you choose a data structure, you're also choosing a memory layout. Arrays are contiguous and cache-friendly. Linked lists and object graphs are scattered and cache-hostile. TypedArrays tell the engine exactly what type to expect, enabling optimizations that generic arrays can't.
You don't need to optimize for cache lines in your daily work. But when something is unexpectedly slow—and the algorithm analysis says it should be fast—the answer is often hiding in the memory hierarchy. Understanding this layer gives you the intuition to ask the right questions and look in the right places.