Computer Architecture and Memory

Mar 18, 2026

Computer Science

I used to think hardware was irrelevant to a web developer. I write JavaScript—what do I care about CPU caches? Then I profiled a function that processed a large dataset and discovered that changing the data layout—same algorithm, same Big O—made it 4x faster. The reason: cache locality. The CPU was spending more time fetching data from memory than doing actual computation. That experience convinced me that understanding the hardware beneath your code isn't optional, even if you never write assembly.

This post covers the architecture concepts that leak up through every abstraction layer and affect the performance of code at every level.

The CPU

A CPU (Central Processing Unit) is the brain of the computer. At its core, it does three things: fetch an instruction, decode it, and execute it. It does this billions of times per second.

Cores and Clock Speed

Modern CPUs have multiple cores, each capable of executing instructions independently. A 4-core CPU can run 4 threads truly in parallel.

Clock speed (measured in GHz) determines how many cycles per second a core can execute. A 3 GHz core does 3 billion cycles per second. Simple operations (integer addition, comparison) take 1 cycle. More complex operations (division, memory access) take many cycles.

More cores vs faster cores: For single-threaded code (most JavaScript), a faster single core matters more. For parallel workloads (video encoding, compilation), more cores matter more. This is why Apple's M-series chips use a mix of performance cores (fast, power-hungry) and efficiency cores (slow, power-efficient).

Registers

Registers are the fastest storage in the CPU—tiny (a few hundred bytes total), on-chip memory that the CPU accesses in a single cycle. When the CPU adds two numbers, both numbers must be in registers. Everything else (RAM, disk) is accessed indirectly.

Modern x86-64 CPUs have ~16 general-purpose registers, each 64 bits wide. That's 128 bytes for the CPU's "working memory." Everything your program does eventually passes through these few registers.

The Memory Hierarchy

The fundamental tension in computer architecture is that fast memory is small and expensive, while large memory is slow and cheap. The solution is a hierarchy:

┌──────────────────────────────────────────────┐
│  Registers      ~1 cycle     ~hundreds bytes │
├──────────────────────────────────────────────┤
│  L1 Cache       ~4 cycles    ~64 KB          │
├──────────────────────────────────────────────┤
│  L2 Cache       ~12 cycles   ~256 KB-1 MB    │
├──────────────────────────────────────────────┤
│  L3 Cache       ~40 cycles   ~8-64 MB        │
├──────────────────────────────────────────────┤
│  Main RAM       ~200 cycles  ~8-128 GB       │
├──────────────────────────────────────────────┤
│  SSD             ~10,000+    ~256 GB-4 TB    │
├──────────────────────────────────────────────┤
│  HDD             ~1,000,000+ ~1-20 TB        │
└──────────────────────────────────────────────┘

The numbers are approximate, but the ratios are real. Accessing L1 cache is ~50x faster than accessing RAM. Accessing RAM is ~1,000x faster than accessing an SSD. This means a cache miss (data not in cache, must go to RAM) is not a minor penalty—it's a 50x slowdown for that operation.

CPU Caches

Caches are small, fast memory that sit between the CPU and RAM. They automatically store recently accessed data and data that's likely to be accessed next.

How Caches Work

When the CPU reads a memory address, it checks L1 first, then L2, then L3, then RAM. If the data is found in cache, it's a cache hit (fast). If not, it's a cache miss (slow—the CPU stalls while waiting for data from the next level).

Caches don't load individual bytes. They load cache lines—typically 64 bytes at a time. When you access array[0], the cache loads bytes 0-63. If array[1] is within those 64 bytes, it's already in cache. This is why sequential access patterns are fast and random access patterns are slow.

Cache Locality

Spatial locality: If you access address X, you'll likely access addresses near X soon. Arrays have great spatial locality—elements are stored contiguously, so accessing one loads nearby elements into cache.

Temporal locality: If you access address X, you'll likely access it again soon. Loop variables, frequently called functions, and hot data all benefit from temporal locality.

// Good cache locality — sequential access through contiguous memory
const arr = new Float64Array(1_000_000)
let sum = 0
for (let i = 0; i < arr.length; i++) {
  sum += arr[i] // each access loads nearby elements into cache
}
 
// Poor cache locality — random access, cache lines wasted
const indices = shuffle([...Array(1_000_000).keys()])
let sum2 = 0
for (const i of indices) {
  sum2 += arr[i] // each access likely misses cache
}

Both loops do the same work (sum all elements) with the same Big O (O(n)). But the sequential version can be 5-10x faster because of cache effects. The random-access version triggers a cache miss on nearly every access.

Data Layout Matters

The way you organize data in memory directly affects cache performance.

Array of Structs vs Struct of Arrays

Consider processing a list of particles, each with position (x, y) and velocity (vx, vy):

Array of Structs (AoS):

// Each particle is an object — scattered on the heap
const particles = [
  { x: 1, y: 2, vx: 0.1, vy: 0.2 },
  { x: 3, y: 4, vx: 0.3, vy: 0.4 },
  // ...
]
 
// Update positions
for (const p of particles) {
  p.x += p.vx
  p.y += p.vy
}

Struct of Arrays (SoA):

// Each property in its own contiguous array
const x = new Float64Array(count)
const y = new Float64Array(count)
const vx = new Float64Array(count)
const vy = new Float64Array(count)
 
// Update positions
for (let i = 0; i < count; i++) {
  x[i] += vx[i]
  y[i] += vy[i]
}

The SoA version is often significantly faster for bulk operations. When updating x-positions, only the x and vx arrays are accessed—they're contiguous in memory, so cache lines are fully utilized. In the AoS version, each object is a separate heap allocation, and loading one particle's x also loads its y, vx, vy into the cache line, wasting cache space if you only need x and vx.

This pattern shows up in game engines, physics simulations, and data processing. The ECS (Entity Component System) pattern popular in game development is essentially SoA applied to game objects.

The Instruction Pipeline

Modern CPUs don't execute one instruction at a time. They use a pipeline—multiple instructions in different stages of execution simultaneously.

Clock:   1    2    3    4    5    6
Inst 1:  F    D    E    W
Inst 2:       F    D    E    W
Inst 3:            F    D    E    W

F = Fetch, D = Decode, E = Execute, W = Write back

While instruction 1 is executing, instruction 2 is being decoded, and instruction 3 is being fetched. This means the CPU effectively completes one instruction per cycle, even though each instruction takes multiple cycles to complete.

Pipeline stalls occur when the next instruction depends on the result of the current one, or when a branch (if/else) makes the CPU uncertain about what to fetch next.

Branch Prediction

When the CPU encounters a conditional branch (if, switch, loop condition), it doesn't know which path to take until the condition is evaluated. Rather than stalling the pipeline, the CPU predicts which branch will be taken and speculatively executes that path. If the prediction is correct, execution continues seamlessly. If wrong, the pipeline must be flushed and restarted—a penalty of ~15-20 cycles.

Modern branch predictors are remarkably accurate (>95% for typical code), but unpredictable branches hurt:

// Predictable — branch predictor learns the pattern quickly
const sorted = data.toSorted((a, b) => a - b)
let sum = 0
for (const val of sorted) {
  if (val >= 128) sum += val // after a point, always true
}
 
// Unpredictable — branch predictor fails ~50% of the time
const shuffled = shuffle([...data])
let sum2 = 0
for (const val of shuffled) {
  if (val >= 128) sum2 += val // random true/false
}

On sorted data, the branch transitions from "always false" to "always true" exactly once. The predictor handles this easily. On shuffled data, the branch is essentially random, causing frequent mispredictions. The sorted version can be 2-5x faster despite identical algorithmic work.

Branchless programming avoids this by converting conditionals to arithmetic:

// Branchless equivalent: val >= 128 ? val : 0
sum += val * (val >= 128)

This matters in performance-critical inner loops (game engines, signal processing, database engines). For application code, it's rarely worth the readability cost.

Instruction-Level Parallelism

Modern CPUs can execute multiple independent instructions simultaneously, even within a single core. This is called superscalar execution.

// These operations are independent — CPU can execute them in parallel
const a = x + 1
const b = y * 2
const c = z - 3
 
// This chain is dependent — each must wait for the previous
const a = x + 1
const b = a * 2 // depends on a
const c = b - 3 // depends on b

The compiler and CPU reorder and parallelize independent operations automatically. You rarely need to think about this, but it explains why breaking data dependencies (using separate accumulators in a reduction, for example) can speed up tight loops.

SIMD — Single Instruction, Multiple Data

SIMD instructions operate on multiple values simultaneously. Instead of adding two numbers, a SIMD instruction adds four (or eight, or sixteen) pairs of numbers in a single cycle.

Normal:     a₁ + b₁ = c₁    (1 operation per cycle)

SIMD:       a₁ + b₁ = c₁
            a₂ + b₂ = c₂    (4 operations per cycle)
            a₃ + b₃ = c₃
            a₄ + b₄ = c₄

You won't write SIMD instructions in JavaScript, but you benefit from them indirectly. TypedArrays (Float64Array, Int32Array) give the engine a chance to use SIMD because the data is densely packed with known types. V8 and other engines apply SIMD optimizations to certain array operations when possible.

WebAssembly has explicit SIMD support, which is why performance-critical libraries (image processing, physics, codecs) compile to Wasm.

How This Affects Your Code

You don't need to write assembly or count cache lines. But these concepts explain real-world performance behaviors:

Observation	Architecture explanation
Processing a TypedArray is faster than a regular Array	Contiguous memory, cache-friendly, SIMD-eligible
Sorted data processes faster in some algorithms	Branch prediction succeeds on sorted patterns
Object-heavy code has higher GC pressure	Many small heap allocations, poor locality
First request to an API is slow, subsequent are fast	Page cache (disk → RAM), CPU cache warmup
Column-oriented databases are fast for analytics	Each column is contiguous — cache-friendly scans
B-tree indexes use large nodes	Minimize disk reads by matching disk page size
Redis is fast	Data is entirely in RAM — no disk latency

The Pragmatic Takeaway

The memory hierarchy is the single most important architecture concept for software performance. The difference between L1 cache and RAM is 50x. The difference between RAM and SSD is 1,000x. Every abstraction you use—from JavaScript arrays to database indexes to HTTP caches—is designed to keep frequently accessed data as close to the CPU as possible.

When you choose a data structure, you're also choosing a memory layout. Arrays are contiguous and cache-friendly. Linked lists and object graphs are scattered and cache-hostile. TypedArrays tell the engine exactly what type to expect, enabling optimizations that generic arrays can't.

You don't need to optimize for cache lines in your daily work. But when something is unexpectedly slow—and the algorithm analysis says it should be fast—the answer is often hiding in the memory hierarchy. Understanding this layer gives you the intuition to ask the right questions and look in the right places.