Probability and Statistics for Engineers

Mar 18, 2026

Computer Science

I used to think statistics was for data scientists, not for someone building web apps. Then I had to evaluate whether a new checkout flow actually improved conversion, interpret a monitoring alert that flagged a 0.3% increase in error rate, and decide whether to roll back a deploy based on 45 minutes of data. Each time, I made the wrong call—or took too long to make the right one—because I didn't have the mental framework to reason about uncertainty.

Statistics isn't about formulas. It's about making decisions with incomplete information—which is what engineers do every day.

Probability Basics

Probability is a number between 0 and 1 that represents how likely an event is. 0 means impossible, 1 means certain, 0.5 means equally likely as not.

P(heads) = 0.5         — fair coin
P(six) = 1/6 ≈ 0.167  — fair die
P(rain tomorrow) = 0.3 — weather model's estimate

Basic Rules

Addition rule: For mutually exclusive events (can't both happen), the probability of either is the sum.

P(A or B) = P(A) + P(B)     when A and B are mutually exclusive
P(1 or 6 on a die) = 1/6 + 1/6 = 1/3

Multiplication rule: For independent events (one doesn't affect the other), the probability of both is the product.

P(A and B) = P(A) × P(B)    when A and B are independent
P(two heads in a row) = 0.5 × 0.5 = 0.25

Complement: The probability of something NOT happening.

P(not A) = 1 - P(A)
P(at least one failure in 100 requests at 1% error rate) = 1 - 0.99^100 ≈ 0.634

That last one is why "99% uptime" is less reassuring than it sounds. Over 100 requests, there's a 63% chance at least one fails.

Random Variables and Distributions

A random variable assigns a number to each outcome of a random process. The distribution describes the probabilities of all possible values.

Expected Value and Variance

Expected value (mean): The average outcome if you repeated the experiment infinitely many times.

E[die roll] = (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.5

Variance: How spread out the values are. Low variance = values cluster near the mean. High variance = values are scattered.

Var(X) = E[(X - μ)²]

Standard deviation is the square root of variance—it's in the same units as the data, which makes it more intuitive.

Key Distributions

Bernoulli: A single trial with two outcomes (success/failure). Probability of success = p.

API call succeeds: p = 0.99
API call fails: 1 - p = 0.01

Binomial: Number of successes in n independent Bernoulli trials.

Out of 1000 API calls with 1% error rate:
Expected failures = n × p = 1000 × 0.01 = 10
Standard deviation = √(n × p × (1-p)) ≈ 3.15

So you'd typically see 7-13 failures. If you see 25 failures, something is probably wrong.

Poisson: Number of events in a fixed time period when events occur independently at a constant rate.

If a server gets 5 errors per hour on average:
P(0 errors in next hour) = e^(-5) × 5^0 / 0! ≈ 0.007
P(exactly 5 errors) ≈ 0.175
P(10 or more errors) ≈ 0.032

Poisson is the go-to model for rare events: server errors, customer signups per hour, exceptions per deployment.

Normal (Gaussian): The bell curve. The distribution of many natural phenomena and, by the Central Limit Theorem, the distribution of sample averages regardless of the original distribution.

68% of values fall within 1 standard deviation of the mean
95% within 2 standard deviations
99.7% within 3 standard deviations

This is why monitoring tools use "3-sigma" alerts: if a metric is more than 3 standard deviations from its average, there's only a 0.3% chance it's normal variation.

Bayes' Theorem

Bayes' theorem updates your belief about a hypothesis given new evidence.

P(H | E) = P(E | H) × P(H) / P(E)

P(H | E): Probability of hypothesis H given evidence E (what you want to know).
P(E | H): Probability of evidence E if hypothesis H is true.
P(H): Prior probability of H (what you believed before the evidence).
P(E): Total probability of seeing the evidence.

Practical example: spam filtering.

Suppose 20% of emails are spam. The word "winner" appears in 80% of spam emails and 5% of legitimate emails. An email contains "winner"—what's the probability it's spam?

P(spam | "winner") = P("winner" | spam) × P(spam) / P("winner")
                   = 0.80 × 0.20 / (0.80 × 0.20 + 0.05 × 0.80)
                   = 0.16 / 0.20
                   = 0.80

80% chance it's spam. Naive Bayes classifiers use this calculation across many features (words) to classify emails, and they work surprisingly well.

Why Bayes matters for engineers: When your monitoring alerts on an anomaly, the probability that something is actually wrong depends on the alert's false positive rate and how often things actually break. A test with a 5% false positive rate will cry wolf constantly if real incidents only happen 0.1% of the time.

Confidence Intervals

A confidence interval gives a range of plausible values for an unknown parameter, along with a confidence level.

"The average response time is 150ms with a 95% confidence interval of [142ms, 158ms]" means: if we repeated this measurement many times, 95% of the intervals we'd compute would contain the true average.

CI = sample mean ± z × (standard deviation / √n)

For 95% confidence: z ≈ 1.96

Why sample size matters: With 10 data points, your confidence interval is wide (uncertain). With 10,000 data points, it's narrow (precise). This is why A/B tests need sufficient sample sizes.

n = 100:  CI = 150 ± 1.96 × (50/10) = [140.2, 159.8]    — wide
n = 10000: CI = 150 ± 1.96 × (50/100) = [149.02, 150.98] — narrow

Hypothesis Testing and P-Values

Hypothesis testing is the formal framework for asking "is this difference real or just noise?"

The setup:

Null hypothesis (H₀): There's no effect. The new feature doesn't change conversion.
Alternative hypothesis (H₁): There is an effect. The new feature improves conversion.
Collect data and compute a test statistic.
Calculate the p-value: the probability of seeing a result this extreme if the null hypothesis were true.
If p-value < significance level (typically 0.05), reject the null hypothesis.

What a p-value IS: The probability of seeing data at least this extreme, assuming nothing has changed.

What a p-value IS NOT: The probability that the null hypothesis is true. This is the most common misinterpretation, and it leads to bad decisions.

A p-value of 0.03 means: "If nothing actually changed, there's a 3% chance we'd see a difference this large by random chance alone." It does NOT mean "there's a 97% chance the change is real."

A/B Testing

A/B testing applies hypothesis testing to product decisions. Split users into two groups, show each a different variant, measure the outcome, and determine if the difference is statistically significant.

Variant A (control): 1000 users, 50 conversions → 5.0% rate
Variant B (treatment): 1000 users, 65 conversions → 6.5% rate

Difference: +1.5 percentage points
Is this real?

Calculating Significance

Pooled proportion: p = (50 + 65) / 2000 = 0.0575
Standard error: SE = √(p(1-p)(1/n₁ + 1/n₂)) = √(0.0575 × 0.9425 × 0.002) ≈ 0.0104
Z-statistic: z = (0.065 - 0.050) / 0.0104 ≈ 1.44
P-value: ≈ 0.075 (one-tailed)

P-value = 0.075 > 0.05, so we can't reject the null hypothesis. The difference could be noise. We need more data.

Common A/B Testing Mistakes

Peeking: Checking results daily and stopping when p < 0.05. This dramatically inflates false positives because you're effectively running multiple tests. Use sequential testing methods (like CUPED or always-valid p-values) if you must peek.

Underpowered tests: Running a test with too few users. Power is the probability of detecting a real effect. A test with 50% power has a coin-flip chance of missing a real improvement. Aim for 80% power minimum.

Multiple comparisons: Testing 20 metrics and celebrating the one that's significant. If you test 20 independent things at α = 0.05, you expect 1 false positive. Apply Bonferroni correction or use a pre-registered primary metric.

Simpson's paradox: An effect that appears in aggregated data disappears or reverses when you split by segments. Always check results by key segments (device type, geography, user tenure).

Ignoring practical significance: A statistically significant 0.01% conversion improvement might be real but not worth the engineering cost of maintaining two code paths.

Monitoring and Alerting

Statistics directly informs monitoring. An alert should fire when a metric is anomalous, not just when it crosses a fixed threshold.

Static thresholds (error rate > 5%) are simple but fragile. They don't account for normal variation, time-of-day patterns, or seasonal trends.

Statistical alerts are better:

Z-score alerts: Fire when a metric is more than k standard deviations from its rolling average. Adapts to the metric's natural variability.
Percentile alerts: Fire when p99 latency exceeds a threshold. More robust than average-based alerts because outliers don't skew the calculation.
Rate-of-change alerts: Fire when a metric's rate of change exceeds normal bounds. Catches sudden spikes even if the absolute value is within range.

Normal error rate: mean = 0.5%, stddev = 0.15%
Current error rate: 1.2%
Z-score: (1.2 - 0.5) / 0.15 = 4.67

That's a 4.67-sigma event — extremely unlikely to be noise. Alert!

The Pragmatic Takeaway

You don't need to derive formulas. You need to build intuition for three things:

Is this difference real or noise? This is hypothesis testing. Before celebrating a conversion improvement or panicking about an error spike, ask: how likely is this result if nothing actually changed? If you don't have enough data to answer confidently, wait.
How much data do I need? Sample size determines precision. Small samples give wide confidence intervals and unreliable conclusions. Use power calculators before starting A/B tests.
Is my mental model calibrated? Humans are terrible at intuiting probability. We underestimate the likelihood of coincidences, overreact to small samples, and see patterns in noise. Statistical thinking is a correction for these biases.

The engineers who make the best data-driven decisions aren't the ones who know the most formulas. They're the ones who know when to trust the numbers and when the numbers don't yet say enough. That judgment—knowing the boundary between signal and noise—is what statistics gives you.