Design of Experiments

A Practical Guide — From Theory to Automated Analysis

Preface

This book teaches the theory and practice of Design of Experiments (DOE) — a systematic method for planning experiments that extracts the maximum information from the minimum number of runs.

It is written for engineers, scientists, and anyone who runs experiments, benchmarks, or tests and wants to do so more efficiently.

DOE is not new. Ronald Fisher developed the foundations in the 1920s and 1930s for agricultural experiments at Rothamsted. George Box, Genichi Taguchi, and others extended it to manufacturing, quality engineering, and process optimization. Yet the technique remains underused, especially in software engineering and technology, where practitioners often default to changing one variable at a time or running exhaustive grid searches.

This guide takes a practical approach. Each chapter introduces the theory you need, then shows you how to apply it. Mathematical derivations are included where they build intuition, but the emphasis is on understanding when and why to use each technique, not on proofs.

Who This Book Is For

Engineers optimizing systems, processes, or software configurations. Scientists planning laboratory or field experiments. Data scientists evaluating model hyperparameters. Quality engineers improving manufacturing processes. Anyone who has ever wondered: "Which of these 10 knobs actually matters?"

Prerequisites: Basic familiarity with mean, variance, and standard deviation. No prior statistics coursework is assumed.

How This Book Is Organized

Part I (Chapters 1–2) lays the conceptual and statistical groundwork. You'll learn why planned experiments outperform ad hoc testing and acquire the vocabulary needed for the rest of the book.

Part II (Chapters 3–6) covers the major design families — full factorials, fractional factorials, screening designs, and response surface designs. Each chapter explains when to use the design, how the mathematics works, and walks through a realistic example.

Part III (Chapters 7–9) focuses on analysis: interpreting Pareto charts, fitting response surface models, diagnosing problems, and making data-driven decisions.

The User Guide at the end is a complete software reference for the DOE Helper Tool, including configuration fields, CLI commands, and step-by-step tutorials.

📖 Reading Order

If you're new to DOE, read Chapters 1–3 first, then jump to Chapter 7 to learn analysis. Come back to Chapters 4–6 when you need designs beyond the basic factorial. If you already know DOE and just want to use the tool, skip straight to the User Guide.


Chapter 1

Why Design of Experiments?

The Cost of Unplanned Experiments

Every experiment costs something — time, money, materials, compute cycles. A poorly planned experiment wastes these resources and may lead to wrong conclusions. A well-designed experiment extracts the maximum learning from each run.

🍰 The Cake Analogy

Imagine you're baking a cake and want to make it taste better. You could try changing the sugar amount, then the oven temperature, then the baking time — one thing at a time. After dozens of attempts, you might find a decent recipe. But you would never discover that a specific combination of less sugar with higher temperature produces the best result, because you never tested them together. DOE solves exactly this problem.

Consider a team tuning a database. They have five configuration knobs they suspect affect performance: buffer pool size, I/O capacity, connection limit, query cache, and thread pool size. The one-variable-at-a-time (OVAT) approach would be:

  1. Fix all knobs at their defaults.
  2. Change buffer pool size. Measure.
  3. Reset buffer pool. Change I/O capacity. Measure.
  4. Continue for each knob.

This requires at least 10 runs (2 per knob) and only measures each knob's effect in isolation, at the default settings of every other knob. It tells you nothing about how the knobs interact.

A Plackett-Burman screening design tests all five knobs in just 8 runs and estimates every main effect simultaneously. A full factorial on the two most important knobs then takes 4 more runs and reveals their interaction. Twelve runs total, and you know more than OVAT would teach you in fifty.

The Three Principles of DOE

Fisher established three principles that remain the foundation of every well-designed experiment:

1. Randomization

Randomly assign the order in which experimental runs are performed to protect against unknown lurking variables. If you run all low-temperature experiments in the morning and all high-temperature experiments in the afternoon, any morning-to-afternoon drift becomes confounded with your temperature factor.

Randomization does not eliminate lurking variables — it distributes their effects equally across all treatments, on average. This ensures that the statistical tests remain valid.

2. Replication

Repeat experimental runs to estimate experimental error and increase the precision of effect estimates. A single run at each combination tells you the response, but not how much it varies.

Replication vs. Repetition

Replication means independently setting up and running the experiment again. Repetition means measuring the same setup multiple times. Repetition estimates measurement error, but replication estimates the full experimental error. DOE requires replication.

3. Blocking

Group experimental runs into blocks of similar conditions to remove known sources of variation from the experimental error. If you must run experiments over two days, make each day a block.

The OVAT Trap: A Quantitative Example

Suppose the true relationship between yield and two factors (temperature T and pressure P) is:

Y = 50 + 10T + 5P + 8TP + ε

where T, P ∈ {-1, +1} (coded levels) and ε ~ N(0, 1).

✗ OVAT Approach

Run at T=-1, P=-1: Y = 43

Change T to +1: Y = 47

Conclusion: Temperature effect = 4

Wrong! True effect is 10.

✓ Factorial Approach

All 4 combinations tested.

MET = (47+73)/2 - (43+37)/2 = 20

Effect = 20/2 = 10

Correct! Also finds interaction = 8.

The OVAT estimate is biased because it was measured at P=-1, where the interaction TP reduces temperature's apparent effect. The factorial design correctly estimates both effects and their interaction, using the same total data.

Hidden Replication in Factorials

In a 2k factorial, every effect estimate uses all 2k observations. Each observation plays a role in estimating every effect. This is why factorials are so efficient — there is no wasted data.

A Concrete Example: Web Server Tuning

Suppose you are optimizing a web server. Two factors matter: thread count (10 or 100) and cache size (64 MB or 512 MB). The true performance:

ThreadsCacheThroughput (req/sec)
1064 MB500
10064 MB600
10512 MB700
100512 MB1,400

OVAT Result

Starting at (10, 64MB):

Threads: 500 → 600 (+100)

Cache: 500 → 700 (+200)

Recommendation: Use 512MB cache alone.

Predicted: 700 req/sec

DOE Result

All 4 combinations tested:

Thread effect: +400

Cache effect: +500

Interaction: +600!

Recommendation: Both high.

Actual: 1,400 req/sec

OVAT would have left 700 req/sec on the table — a 2x performance improvement, missed entirely because OVAT can't detect interactions.

Effect Sparsity: Why DOE Works

A fundamental observation underpins DOE's practicality: in most real systems, only a small fraction of factors truly matter. This is called the effect sparsity principle (sometimes the Pareto principle of effects). Out of 20 possible factors, typically 3–5 drive 80% or more of the variation.

This means you can afford a two-stage strategy: screen many factors cheaply to find the vital few, then investigate those few in detail. The total cost is far less than testing every factor exhaustively.

Effect Hierarchy

A related principle: main effects tend to be larger than two-factor interactions, which tend to be larger than three-factor interactions, and so on. This hierarchy is why fractional factorial designs work — they sacrifice higher-order interaction estimates that are usually negligible anyway.

When NOT to Use DOE

DOE is not always the right tool. Consider alternatives when:

  • You have only one factor. A simple A/B test or one-way comparison is sufficient.
  • The response is not measurable. DOE requires a quantifiable outcome. Subjective preferences need special treatment (conjoint analysis, paired comparisons).
  • The system is completely unknown. If you have no idea which factors might matter, start with domain-expert brainstorming and a few exploratory runs before committing to a formal design.
  • Runs are effectively free. If you can test millions of combinations cheaply (e.g., a fast simulation), a random search or grid search may be simpler.

In most practical engineering situations, however, experiments have a non-trivial cost, and DOE will save you time and resources.


Chapter 2

Statistical Foundations

The Linear Model

DOE assumes the response can be modeled as a linear combination of factor effects:

Y = β0 + β1X1 + β2X2 + β12X1X2 + ε

where β0 is the grand mean, βi are main effect coefficients, βij are interaction coefficients, and ε is random error.

This model is "linear" in the parameters (β), not necessarily in the factors. Even quadratic terms like X2 are linear in the coefficient βii.

The Full Model for Three Factors

For 3 factors (A, B, C) with interactions and quadratic terms, the complete model is:

Y = β0 + β1x1 + β2x2 + β3x3 + β12x1x2 + β13x1x3 + β23x2x3 + β123x1x2x3 + ε

That's 8 terms (including the intercept) for 3 factors — exactly the number of runs in a 23 factorial. This isn't a coincidence: each run provides one equation, and the design is constructed so that the 8 equations are independent (orthogonal). When we add quadratic terms β11x12, β22x22, β33x32, the model grows to 11 terms, requiring more than 8 runs — this is why response surface designs (CCD, Box-Behnken) need more runs than factorials.

Linear model (factorial)

1 + k + C(k,2) + ... terms

For k=3: 8 terms, 8 runs

Captures main effects + interactions

Quadratic model (RSM)

Linear + k quadratic terms

For k=3: 10–11 terms, 15–20 runs

Also captures curvature

Coded Variables

We convert natural factor values to coded values (-1, +1) using:

x = (X - X̅) / (Xrange / 2)

This standardizes all factors to the same scale, making effect magnitudes directly comparable. A factor with levels 150°C and 200°C has coded values -1 and +1, where 175°C is zero.

📏 Why Code?

Imagine comparing the "effect" of temperature (150 vs 200) to pressure (2 vs 6). In natural units, the numbers aren't comparable. In coded units, both range from -1 to +1, so an effect of 10 in temperature is directly comparable to an effect of 5 in pressure.

Worked Example: Coding Factor Values

Consider a temperature factor with levels 150°C (low) and 200°C (high):

Natural valueCalculationCoded value
150°C(150 - 175) / 25-1
175°C(175 - 175) / 250
200°C(200 - 175) / 25+1
162.5°C(162.5 - 175) / 25-0.5
212.5°C(212.5 - 175) / 25+1.5 (star point)

The center (X̅) is (150 + 200) / 2 = 175, and the half-range is (200 - 150) / 2 = 25. Star points in CCD designs produce coded values beyond ±1, which is how they sample outside the original range.

Why Coding Matters for the DOE Helper Tool

The RSM fitting in doe optimize automatically codes all continuous factors to [-1, +1] before fitting. This means the reported coefficients are directly comparable in magnitude — a coefficient of 10 for temperature is twice as important as a coefficient of 5 for pressure, regardless of their natural units. Categorical factors are coded as -1 (first level) and +1 (last level).

Estimating Effects

Main Effects

The main effect of factor A is the difference between the average response at the high level and the average response at the low level:

MEA = Y̅A+ - Y̅A-

Interaction Effects

The interaction effect of factors A and B measures whether the effect of A depends on the level of B:

IEAB = ½[(Y̅A+,B+ - Y̅A-,B+) - (Y̅A+,B- - Y̅A-,B-)]

If this is zero, A and B act independently. If large, their combined effect is synergistic (positive) or antagonistic (negative).

Percentage Contribution

To compare the relative importance of effects, we compute:

% Contributioni = |Effecti| / Σ|Effectj| × 100

This is what the Pareto chart displays — a ranked bar chart of contributions with a cumulative line.

📊 Reading a Pareto Chart

The DOE Helper Tool generates Pareto charts automatically. The left axis shows effect magnitude as bars (tallest first). The right axis shows a cumulative percentage line. Factors to the left of where the cumulative line reaches ~80% are the "vital few" — the factors that drive most of the variation. Factors to the right are the "trivial many" that can often be ignored or fixed at convenient levels.

Confidence Intervals

A 95% confidence interval on a main effect is:

Effect ± tα/2, df × SEeffect

If the interval includes zero, the effect is not statistically significant at the 5% level. The DOE Helper Tool computes these automatically using the t-distribution.

Practical vs. Statistical Significance

A statistically significant effect might be too small to matter in practice. Conversely, a practically important effect might not reach statistical significance with few observations. Always consider both.

What Affects the Width of Confidence Intervals?

The confidence interval width depends on three things:

FactorEffect on CI widthHow to improve
Number of runs (N)More runs → narrower CIsAdd replicates via block_count
Experimental noise (σ)Higher noise → wider CIsImprove measurement precision, control nuisance variables
Confidence level99% → wider than 95%Accept 95% (standard practice)

🔍 The Microscope Analogy

Think of confidence intervals as the resolution of a microscope. More replicates give you a higher-power lens — you can see smaller effects. With too few runs, only the biggest effects stand out above the noise floor. This is why the screening-to-optimization pipeline works: use PB to find the big effects, then use RSM with more runs per factor to precisely measure them.

Orthogonality and Balance

A design is orthogonal when the columns of the design matrix are uncorrelated. This means each effect estimate is independent of every other — changing one factor's range or removing one effect from the model doesn't change the other estimates.

A design is balanced when each factor level appears equally often. Balance ensures that effect estimates use all available data symmetrically.

⚖ The Scale Analogy

Think of balance like a weighing scale. If you test factor A at the high level 6 times but at the low level only 2 times, your estimate of A's effect is skewed — it's like putting heavier weights on one side of the scale. Balanced designs put equal weight on both sides.

Full factorial and Plackett-Burman designs are both orthogonal and balanced. This is what makes them statistically powerful — every run contributes useful information to every effect estimate.

Visualizing Orthogonality

In an orthogonal design, knowing the level of one factor tells you nothing about the level of any other factor. Here's the contrast:

Not orthogonal

When A is high, B is always high. The effects of A and B are confounded — you can't tell which one caused the change.

RunAB
1-1-1
2+1+1
3-1-1
4+1+1

Orthogonal

A and B vary independently. Each effect is estimated without contamination from the other.

RunAB
1-1-1
2+1-1
3-1+1
4+1+1

Degrees of Freedom

Degrees of freedom (df) determine how much information your design provides. In a 2k factorial with N = 2k runs:

  • 1 df for the overall mean
  • k df for main effects
  • C(k,2) df for two-factor interactions
  • Remaining df available for error estimation (if you have replicates)

Without replicates, you must assume higher-order interactions are zero and "pool" their degrees of freedom to estimate error. This is why center-point replicates are valuable — they provide pure-error degrees of freedom without increasing the factorial portion of the design.

Worked Example: Degree of Freedom Budget

For a 24 factorial (16 runs) with 2 replicates (32 total runs):

SourcedfWhat it estimates
Mean1Overall average
Main effects (A, B, C, D)4Individual factor contributions
2-factor interactions (AB, AC, ...)6Pairwise synergies/antagonisms
3-factor interactions (ABC, ...)4Usually negligible (pooled to error)
4-factor interaction (ABCD)1Usually negligible (pooled to error)
Pure error (from replication)16Experimental noise
Total32

The 16 error degrees of freedom come from the replicate: each of the 16 factorial points has 2 observations, giving 16 × (2-1) = 16 df for pure error. Without replication (single replicate), you'd need to assume higher-order interactions are negligible and pool their df to estimate error.

The Unreplicated Factorial Dilemma

With no replicates, there's no independent error estimate. You must either (a) assume high-order interactions are zero and use their degrees of freedom for error, (b) use the normal probability plot to visually separate real effects from noise, or (c) add center-point replicates. The DOE Helper Tool supports option (a) automatically and option (c) via blocking.

The Normal Probability Plot

When you have no replicates (and therefore no independent error estimate), a normal probability plot of effects helps identify which effects are real. Negligible effects cluster along a straight line (they are just noise). Effects that deviate from the line are likely real.

Reading a Normal Probability Plot

Points on the line: Noise — these effects are indistinguishable from random error.
Points far from the line: Real effects — the further from the line, the more confident you can be.
This technique was developed by Cuthbert Daniel (1959) and remains one of the most useful diagnostic tools in DOE.

Statistical Power and Sample Size

Power is the probability that your experiment will detect a real effect of a given size. Low power means you might run an experiment and conclude "nothing matters" when, in fact, something does — you just didn't have enough data to see it.

PowerMeaningRuns needed (typical)
50%Coin flip — might miss real effectsUnreplicated factorial
80%Standard target — acceptable for most work2 replicates
90%High confidence — for critical decisions3–4 replicates

The key insight: power depends on the signal-to-noise ratio. If your factor ranges produce large effects relative to experimental noise, even a small design has high power. If the effects are subtle, you need more replicates. This is why factor range selection (Chapter 1) is so important — wider ranges amplify the signal.

Practical Rule of Thumb

You can detect an effect that is at least 2–3 times the standard deviation of your experimental noise. If a previous study or pilot run suggests σ = 5, then you can reliably detect effects of 10–15 or larger. Effects smaller than that will need more replicates or reduced noise to be detected.


Chapter 3

Full Factorial Designs

The 2k factorial tests every combination of k factors at 2 levels each. It is the most complete design — and the foundation for all others.

The 2k Factorial

For k factors, each at 2 levels, we need 2k runs:

4
2 factors
8
3 factors
16
4 factors
32
5 factors
64
6 factors
1024
10 factors

Growth is exponential. This is manageable for 2–5 factors, but becomes impractical beyond that — which motivates fractional and screening designs.

Understanding Exponential Growth

The run count grows faster than most people intuit. Here's a comparison of run count vs. information gained:

Factors (k)Runs (2k)Main effects2-factor interactionsTotal effects estimated
24213
38337
4164615
53251026
66461557
7128721120

At k=5, only 15 of the 31 estimable effects are main effects or two-factor interactions — the rest are higher-order interactions that are usually negligible. This is the fundamental insight that motivates fractional designs: you're paying for information you probably don't need.

💰 The Budget Analogy

Think of each experimental run as spending money. A full factorial spends evenly across all possible effects, including three-way, four-way, and higher interactions. That's like buying comprehensive insurance on a car, house, boat, and spaceship when you only own a car. Fractional designs spend the same budget more wisely by focusing on the effects most likely to matter.

The Design Matrix

A 23 full factorial design matrix:

RunABC
1-1-1-1
2+1-1-1
3-1+1-1
4+1+1-1
5-1-1+1
6+1-1+1
7-1+1+1
8+1+1+1

Every column is balanced: equal numbers of -1 and +1. Every pair of columns is orthogonal. This ensures unbiased, independent estimates of all effects.

How to Read a Design Matrix

Each row is one experiment. Each column is one factor. The values -1 and +1 represent the low and high levels of that factor. To compute any interaction column, multiply the corresponding factor columns element-by-element:

RunABABCACBCABC
1-1-1+1-1+1+1-1
2+1-1-1-1-1+1+1
3-1+1-1-1+1-1+1
4+1+1+1-1-1-1-1
5-1-1+1+1-1-1+1
6+1-1-1+1+1-1-1
7-1+1-1+1-1+1-1
8+1+1+1+1+1+1+1

Notice that every column (including interaction columns) is balanced (four +1s and four -1s) and orthogonal to every other column. This is the mathematical foundation that makes effect estimation unbiased. The DOE Helper Tool constructs and displays this matrix with doe info and doe generate --dry-run.

When to Use Full Factorials

Use Full Factorial When:

• You have 2–4 factors (manageable run count)
• You need to estimate all interactions, not just main effects
• Each run is cheap (fast simulations, automated tests)
• You want zero ambiguity about which effects are real

When to Avoid Full Factorials

• More than 5 factors (run count explodes: 64+ runs)
• Runs are expensive or time-consuming
• You're still in the screening phase and don't know which factors matter
• You need more than 2 levels per factor (consider general factorial or RSM instead)

Worked Example: Packaging Seal Strength

Three factors at 2 levels each: Seal Temperature (A), Pressure (B), Dwell Time (C). Response: seal strength in Newtons.

RunABCStrength (N)
1-1-1-115.3
2+1-1-127.1
3-1+1-118.6
4+1+1-132.4
5-1-1+117.8
6+1-1+129.5
7-1+1+120.1
8+1+1+134.2

Main effect of A (Temperature):

MEA = (27.1 + 32.4 + 29.5 + 34.2)/4 - (15.3 + 18.6 + 17.8 + 20.1)/4 = 30.8 - 17.95 = 12.85 N

Temperature is the dominant factor, increasing seal strength by ~12.85 N on average. The tool computes all effects, confidence intervals, and creates the Pareto chart automatically.

Complete Effect Analysis

Let's compute all effects from the seal strength data:

EffectCalculationValue% Contribution
A (Temperature)(27.1+32.4+29.5+34.2)/4 - (15.3+18.6+17.8+20.1)/412.85 N69.3%
B (Pressure)(18.6+32.4+20.1+34.2)/4 - (15.3+27.1+17.8+29.5)/43.65 N19.7%
C (Dwell Time)(17.8+29.5+20.1+34.2)/4 - (15.3+27.1+18.6+32.4)/42.05 N11.1%
AB½[(32.4-18.6+34.2-20.1) - (27.1-15.3+29.5-17.8)]0.65 N
AC½[(29.5-27.1+34.2-32.4) - (17.8-15.3+20.1-18.6)]0.15 N
BC½[(20.1-18.6+34.2-32.4) - (17.8-15.3+29.5-27.1)]0.15 N

Interpretation: Temperature (A) dominates at 69.3% contribution. Pressure (B) contributes 19.7%, and Dwell Time (C) accounts for 11.1%. All interaction effects are near zero — the factors act independently. The Pareto chart would show a steep drop after A, confirming that temperature is the primary lever for improving seal strength.

The 80/20 Rule in Action

In this example, one factor (A) accounts for ~70% of the total effect. This is typical in practice — most processes are dominated by one or two key factors. The power of factorial designs is that they reveal this unambiguously, including any interactions, in a single set of experiments.

Adding Center Points

A 2k factorial only tests the extremes. What if the response curves between the extremes? Adding center points — runs at the midpoint of all factor ranges — lets you detect curvature without fitting a full quadratic model.

If the center-point average differs significantly from the average of the factorial points, curvature is present, and you should consider upgrading to a CCD or Box-Behnken design for the next phase.

Center Points Serve Two Purposes

1. Curvature detection: Compare the center-point mean to the factorial mean. A statistically significant difference signals nonlinearity.
2. Pure error estimation: Replicated center points provide degrees of freedom for error, even in unreplicated factorials. Three to five center replicates are typically sufficient.

Detecting Curvature with Center Points

Continuing the seal strength example: suppose we add 3 center-point replicates at Temperature=175°C, Pressure=4 bar, Dwell Time=7.5 sec. The results are: 28.1, 27.4, 28.8 N.

Center mean = (28.1 + 27.4 + 28.8) / 3 = 28.10 N
Factorial mean = (15.3 + 27.1 + 18.6 + 32.4 + 17.8 + 29.5 + 20.1 + 34.2) / 8 = 24.38 N

The center mean (28.10) is higher than the factorial mean (24.38), suggesting positive curvature — the response surface bows upward in the middle. The difference (3.72 N) is large enough to warrant investigating a quadratic model using CCD or Box-Behnken.

Center Points Only Detect Curvature — They Don't Model It

Center points tell you whether curvature exists, but not which factor is curved. To fit individual quadratic terms (x12, x22, etc.), you need a response surface design with at least 3 distinct levels per factor.

Blocking a Factorial Design

Sometimes you can't complete all runs under identical conditions. Perhaps the raw material comes in batches, or experiments span multiple days. Blocking partitions the runs into groups (blocks) so that known sources of variation are accounted for.

In a 2k factorial, you assign the highest-order interaction to the block effect. For a 23 design split into 2 blocks, assign ABC to blocks: runs where ABC = -1 go in Block 1, and runs where ABC = +1 go in Block 2. The ABC interaction becomes confounded with the block effect, but all main effects and two-factor interactions remain estimable.

Choose Your Block Generator Wisely

Always confound the highest-order interaction with blocks. This sacrifices the least valuable information. Never confound a main effect or a two-factor interaction you care about with the block variable.

Worked Example: Blocking a 23 Design

We split the 23 seal strength design into 2 blocks. The generator is ABC (the three-factor interaction). Runs where A×B×C = -1 go in Block 1 (day 1); runs where A×B×C = +1 go in Block 2 (day 2):

Block 1 (A×B×C = -1)

RunABC
1-1-1-1
4+1+1-1
6+1-1+1
7-1+1+1

Block 2 (A×B×C = +1)

RunABC
2+1-1-1
3-1+1-1
5-1-1+1
8+1+1+1

Each block is balanced: within each block, every factor has an equal number of -1 and +1 values. This means all main effects (A, B, C) and two-factor interactions (AB, AC, BC) are estimable. Only the ABC interaction is confounded with the block effect — which is usually an acceptable trade-off.

Replicated Factorials

Running the entire factorial more than once gives you a direct estimate of experimental error and increases the precision of all effect estimates. With n replicates of a 2k design, the standard error of each effect is:

SEeffect = 2s / √(n × 2k)

where s is the pooled standard deviation. Doubling the replicates reduces the standard error by a factor of √2 — you get about 41% more precision for twice the runs. This is diminishing returns, so most practitioners use 2–3 replicates rather than many.

Practical Guidance: Running a Full Factorial

Here's how a full factorial experiment typically proceeds with the DOE Helper Tool:

Full factorial workflow
# 1. Configure the experiment $ cat config.json { "factors": [ {"name": "seal_temp", "levels": ["150", "200"], "type": "continuous", "unit": "°C"}, {"name": "pressure", "levels": ["2", "6"], "type": "continuous", "unit": "bar"}, {"name": "dwell_time", "levels": ["5", "10"], "type": "continuous", "unit": "sec"} ], "responses": [{"name": "seal_strength", "optimize": "maximize", "unit": "N"}], "settings": {"operation": "full_factorial", "block_count": 2} } # 2. Preview: 8 base runs × 2 blocks = 16 total runs $ doe info --config config.json # 3. Generate and run $ doe generate --config config.json --seed 42 $ bash run_experiments.sh # 4. Analyze $ doe analyze --config config.json $ doe optimize --config config.json

When to Stop

If the analysis reveals that one or two factors dominate (together accounting for 80%+ of the variation), and the interaction effects are small, you've found the vital few. Fix the unimportant factors at convenient levels and either declare victory or move to a response surface design to find the precise optimum of the important factors.


Chapter 4

Fractional Factorial Designs

When full factorials require too many runs, fractional factorials let you trade off complete interaction information for dramatic run-count reduction.

The Problem of Exponential Growth

Full factorial runs grow as 2k. For 7 factors, that's 128 runs. For 10, it's 1,024. This is often impractical.

But here's the saving insight: in most real systems, high-order interactions (three-factor and above) are negligible. The effect sparsity principle tells us that only a few effects — typically main effects and some two-factor interactions — drive the majority of the variation. A fractional factorial sacrifices information about high-order interactions (which are probably zero anyway) to dramatically reduce the run count.

✂ The Compression Analogy

A full factorial is like a raw, uncompressed image — it captures every pixel of information, including redundant detail you'll never notice. A fractional factorial is like JPEG compression: it discards the least important information (high-order interactions) while preserving the features that matter (main effects and key interactions). The "resolution" of the design tells you how aggressively you've compressed.

The 2k-p Design

A fractional factorial runs 2k-p experiments, where p is the number of "generators." A half-fraction (p=1) cuts runs by 50%. A quarter-fraction (p=2) cuts by 75%.

25-1 = 16
vs. 32 full
27-2 = 32
vs. 128 full
210-3 = 128
vs. 1024 full

Resolution

Resolution describes the severity of the aliasing (confounding):

ResolutionAliasingInterpretation
IIIMain effects aliased with 2-factor interactionsGood for screening; not for estimating interactions
IVMain effects clear; 2-factor interactions aliased with each otherGood for most practical work
V+Main effects and 2FIs clearNearly as good as full factorial

The Trade-Off

Fractional factorials save runs but lose information. In a Resolution III design, if factor A appears important, it might actually be the AB interaction that matters. Follow-up experiments (foldover) can resolve these ambiguities.

Resolution in Practice

Here's what each resolution level means concretely:

III

Main effects aliased with 2FIs. Suitable for screening only. Cannot separate A from BC.

IV

Main effects clear. 2FIs aliased with each other. Most practical choice — you know which factors matter but may not know which pairs interact.

V+

Main effects and all 2FIs clear. Nearly as informative as a full factorial. Rarely needed for screening.

The Minimum Acceptable Resolution

For screening (finding which factors matter): Resolution III is sufficient.
For characterization (understanding interactions): Resolution IV is the minimum.
For definitive studies (estimating all 2FIs): Resolution V or higher.

Generators and Defining Relations

A fractional factorial is constructed by writing k-p factors as a full factorial, then generating the remaining p factors as products of existing columns. For example, in a 25-1 design, we might set E = ABCD. This is the generator.

The defining relation is I = ABCDE (multiply both sides by E). It tells you every alias pair. Any effect multiplied by the defining relation gives its alias:

A × ABCDE = BCDE,   so A is aliased with BCDE
AB × ABCDE = CDE,   so AB is aliased with CDE

In a Resolution V design like this, main effects are aliased only with four-factor interactions (negligible), and two-factor interactions are aliased with three-factor interactions (usually small). This is why Resolution V designs are nearly as good as full factorials.

Multiple Generators and Word Length

When p > 1 (quarter-fraction or smaller), you have multiple generators, and the defining relation contains multiple "words." For a 27-2 design with generators F = ABCD and G = ABDE:

Defining relation: I = ABCDF = ABDEG = CEFG

The third word (CEFG) is the "generalized interaction" of the first two generators. The minimum word length in the defining relation determines the resolution. Here, the shortest word has 4 letters, so this is a Resolution IV design.

The DOE Helper Tool automatically computes and displays the alias structure when you use the info command with a fractional factorial design. This tells you exactly which effects are confounded with each other, so you can interpret your results correctly.

Minimum Aberration

Among all designs with the same resolution, the minimum aberration design has the fewest short words in its defining relation. This minimizes the number of low-order effects that are aliased together. The DOE Helper Tool uses minimum-aberration generators by default — the pyDOE3 library selects them automatically.

Worked Example: 24-1 Half-Fraction

Four factors (A, B, C, D). A full factorial needs 16 runs. A half-fraction (p=1) needs only 8. We set D = ABC as the generator, giving defining relation I = ABCD (Resolution IV).

RunABCD=ABC
1-1-1-1-1
2+1-1-1+1
3-1+1-1+1
4+1+1-1-1
5-1-1+1+1
6+1-1+1-1
7-1+1+1-1
8+1+1+1+1

The alias structure: A+BCD, B+ACD, C+ABD, D+ABC, AB+CD, AC+BD, AD+BC. Main effects are clear of two-factor interactions (Resolution IV), but two-factor interactions are aliased in pairs (AB with CD, etc.).

Interpreting Aliased Effects

The alias pairs in our 24-1 Resolution IV design are:

Estimated effectActually measuresPractical impact
AA + BCDClean (BCD is negligible)
BB + ACDClean (ACD is negligible)
CC + ABDClean (ABD is negligible)
DD + ABCClean (ABC is negligible)
ABAB + CDAmbiguous — could be either
ACAC + BDAmbiguous — could be either
ADAD + BCAmbiguous — could be either

Main effects are clean because they're aliased only with three-factor interactions (which are almost always negligible). But if the "AB" effect looks large, you can't tell whether it's really AB, or really CD, or some combination of both. This is where subject-matter expertise comes in — does an interaction between A and B make physical sense?

🔊 The Cocktail Party Analogy

Aliasing is like two people talking at the same time at a cocktail party. In Resolution IV, the loud speakers (main effects) are in separate rooms — you hear them clearly. But the quieter speakers (two-factor interactions) are paired up in the same room. You can hear that someone is talking, but can't always tell who. A foldover experiment is like asking one of them to step outside so you can hear the other clearly.

Foldover: Resolving Aliases

If a fractional factorial reveals ambiguous results (you can't tell if the effect comes from AB or CD), a foldover resolves the ambiguity. A foldover reverses the signs of one or more columns and runs the resulting design as a supplementary experiment.

Full Foldover

Reversing all column signs and combining with the original design doubles the runs but upgrades the resolution by one level. A Resolution III design becomes Resolution IV. This is the safest option when you're unsure which aliases are causing problems.

Single-Factor Foldover

Reversing just one column (say, factor A) de-aliases all effects involving A. This is more efficient when you suspect a specific factor is involved in the ambiguous interaction.

Worked Example: Full Foldover of a 24-1

If our 24-1 analysis showed a large "AB+CD" effect and we can't tell which pair is responsible, we run a full foldover (reverse all signs):

Original (D = ABC)

Alias: AB + CD

8 runs, Resolution IV

Foldover (D = -ABC)

Alias: AB - CD

8 additional runs, Resolution IV

Combining the two half-fractions:

AB = ½[(AB+CD)original + (AB-CD)foldover]
CD = ½[(AB+CD)original - (AB-CD)foldover]

The combined 16-run design is equivalent to a full 24 factorial. All aliases are resolved. The cost: you've doubled your runs from 8 to 16, but you only do this when the initial analysis reveals ambiguity worth resolving. In many cases, the initial 8-run fraction is sufficient.

Augmentation in the DOE Helper Tool

The doe augment command automates design augmentation. It supports fold-over, star points (to upgrade a factorial to CCD), and center points (to detect curvature and estimate pure error):

# Fold-over to de-alias effects
doe augment --config config.json --type fold_over

# Add star points for RSM
doe augment --config config.json --type star_points

# Add center points to detect curvature
doe augment --config config.json --type center_points

The augmented script only runs the new experiments — it skips already-completed runs automatically.

Choosing the Right Fraction

The DOE Helper Tool selects minimum-aberration generators automatically, but here's the decision logic:

FactorsRecommended DesignRunsResolution
2–4Full factorial4–16
525-116V
626-132VI
727-232IV
8–11Plackett-Burman or 2k-p12–32III–IV
12+Plackett-Burman screening firstvariesIII

Fractional Factorial Workflow

A typical fractional factorial study follows this pattern:

  1. Start with the right resolution. For initial screening, Resolution III or IV. For characterization, Resolution IV+.
  2. Run the fraction. Analyze main effects and aliased interaction estimates.
  3. Decide next steps:
    • If main effects are clear and no interactions look significant → done, proceed to optimization.
    • If some aliased interaction pairs look significant → foldover to resolve them.
    • If many factors are insignificant → drop them and run a full factorial or RSM on the survivors.
Fractional factorial with the DOE Helper Tool
# Config: 5 factors, fractional factorial (2^(5-1) = 16 runs instead of 32) $ doe info --config config.json Operation: fractional_factorial Factors: 5 (all 2-level) Base runs: 16 (Resolution V) # Generate and run $ doe generate --config config.json --seed 42 $ bash run_experiments.sh # Analyze: main effects, interactions, Pareto chart $ doe analyze --config config.json # Result: factors A and C dominate. Move to CCD with just those two.

The Power of Sequential Experimentation

Don't try to learn everything in one experiment. Run a fraction, learn what you can, then decide what to do next. This iterative approach — called sequential experimentation — is more efficient than any single design, no matter how clever. Each round of experiments informs the design of the next round.


Chapter 5

Screening Designs

When you have many factors and limited budget, Plackett-Burman designs let you screen them all with remarkably few runs.

Plackett-Burman Designs

Plackett-Burman designs are 2-level screening designs that require only N+1 runs for N factors (rounded up to a multiple of 4). They estimate main effects only — all interactions are confounded.

8 runs
for 7 factors
12 runs
for 11 factors
20 runs
for 19 factors

How Plackett-Burman Works

PB designs are constructed from special cyclic generator sequences. For N=12 runs, the first row of the design is a specific sequence of + and - signs. Each subsequent row is formed by shifting the previous row one position to the right (wrapping around). The last row is all minus signs.

The resulting design is orthogonal for main effects but has complex confounding patterns for interactions. Unlike fractional factorials, where aliases follow a clean algebraic structure, PB confounding is "partially confounded" — each two-factor interaction is spread across multiple main effect estimates.

PB Confounding Is Messy

In a fractional factorial, you know exactly which effects are aliased (e.g., AB is aliased with CD). In a Plackett-Burman design, each two-factor interaction contributes partially to multiple main effect columns. This means a large interaction could inflate the apparent effect of an unrelated factor. This is acceptable for screening (where you only need to identify candidates), but not for characterization.

Example: 12-Run Plackett-Burman Matrix

A PB-12 design for up to 11 factors. Here we use 7 real factors (A–G) and 4 dummy columns (d1–d4):

RunABCDEFGd1d2d3d4
1++-+++---+-
2-++-+++---+
3+-++-+++---
4-+-++-+++--
5--+-++-+++-
6---+-++-+++
7+---+-++-++
8++---+-++-+
9+++---+-++-
10-+++---+-++
11+-+++---+-+
12-----------

The grey columns (d1–d4) are dummy factors. Their estimated effects measure only noise, providing a reference for judging which real factors are significant.

The Screening-to-Optimization Pipeline

The Two-Stage Strategy

Stage 1 (Screen): Use Plackett-Burman to test all candidates. Identify the 2–3 factors with the largest effects.
Stage 2 (Optimize): Run a full factorial, CCD, or Box-Behnken on just those important factors to find the true optimum.
Result: ~25 total runs instead of hundreds, with a rigorous path from "which factors matter?" to "what's the best setting?"

Limitations

Plackett-Burman cannot estimate interaction effects. If you suspect strong interactions, use a fractional factorial (Resolution IV+) instead. Also, PB requires exactly 2 levels per factor.

The Three-Stage Pipeline in Detail

Stage 1: Screen
Plackett-Burman
7+ factors → 8–12 runs
Find the vital few
Stage 2: Characterize
Factorial or RSM
2–3 factors → 15–20 runs
Model the response surface
Stage 3: Confirm
Verification runs
1 setting → 3–5 runs
Validate the optimum

Total cost: ~25–35 runs to go from "I have 7+ factors and no idea which matter" to "I have a validated optimum setting." Compare this to a full 27 factorial alone (128 runs) or grid search (thousands of runs).

🎯 The Funnel Analogy

Screening is like casting a wide net. You start with many candidates and quickly eliminate the irrelevant ones. Characterization narrows the focus to the survivors. Confirmation is the final check. Each stage is cheaper than the last because you're studying fewer factors at higher resolution. This funnel-shaped approach is the most efficient way to optimize any process.

Worked Example: HPC Cluster Screening

An HPC team suspects that 7 parameters affect job throughput: MPI ranks per node, OpenMP threads, memory allocation policy, I/O scheduler, network buffer size, file striping count, and checkpoint interval. Testing all 128 combinations of a full 27 factorial is impractical — each run takes 20 minutes on a shared cluster.

A Plackett-Burman design requires only 8 runs (2 hours of wall-clock time). The Pareto chart from the analysis reveals that MPI ranks per node and network buffer size together account for 72% of the throughput variation. The other five factors are noise.

Full Factorial

128 runs × 20 min = 42 hours

All 127 effects estimable

Only 2 effects actually matter

Plackett-Burman + Follow-up

8 screening + 4 factorial = 4 hours

Identifies the 2 important factors

Full interaction detail on the vital few

Interpreting the Screening Results

The analysis output from the HPC screening might look like this:

doe analyze output
=== Main Effects: throughput_jobs_hr === Factor Effect Std Error % Contribution -------------------------------------------------------------- mpi_ranks_per_node 45.2 3.1 42.8% network_buffer_size 30.8 3.1 29.2% io_scheduler 8.4 3.1 8.0% file_striping_count 7.1 3.1 6.7% openmp_threads 5.9 3.1 5.6% memory_policy 4.2 3.1 4.0% checkpoint_interval 3.9 3.1 3.7%

The first two factors (MPI ranks and network buffer) together account for 72% of the variation. The remaining five factors each contribute less than 10%. Using the dummy-factor approach, suppose the average dummy effect is 3.0. Only MPI ranks (45.2) and network buffer (30.8) clearly exceed the 2× threshold (6.0), confirming them as the active factors.

The Decision Point

After screening, you face a choice:
Drop the 5 inactive factors and run a full factorial or CCD on MPI ranks + network buffer (as few as 9–13 runs).
Keep io_scheduler (borderline at 8.0%) as a third factor in the follow-up (15–20 runs).
Fix all 5 inactive factors at their cheaper/simpler levels (e.g., default memory policy, shortest checkpoint interval).
The screening data gives you the information to make this decision rationally.

Dummy Factors and Error Estimation

If your Plackett-Burman design has more columns than real factors, the leftover columns become dummy factors — factors that don't correspond to any real variable. Their estimated effects should be zero (they're just noise), so they provide an estimate of experimental error.

Using Dummy Factors

In a 12-run PB design with 7 real factors, you have 4 dummy columns. The effects of these dummies estimate the noise floor. Any real factor whose effect exceeds 2–3 times the average dummy effect is likely significant. This is a practical alternative to formal hypothesis testing when you have no replicates.

Practical Rules for Dummy Factor Analysis

CriterionRuleRationale
Definitely active|Effect| > 3 × avg dummy effectVery unlikely to be noise
Possibly active2–3 × avg dummy effectWorth investigating in follow-up
Inactive|Effect| < 2 × avg dummy effectConsistent with noise

If you have no dummy columns (e.g., a PB-8 with exactly 7 real factors), use the normal probability plot instead. The tool generates Pareto charts that visually separate the large effects from the noise floor — the steep initial bars are the active factors, and the remaining short bars are noise.

Definitive Screening Designs

Definitive Screening Designs (DSDs), introduced by Jones and Nachtsheim (2011), are a modern alternative to Plackett-Burman for continuous factors. They require only 2k+1 runs (for k factors), but can estimate:

  • All main effects
  • All quadratic effects
  • Some two-factor interactions (when factors are dropped from the model)

DSDs use three levels per factor (-1, 0, +1), so they can detect curvature that PB designs miss. The trade-off is slightly more runs and the requirement that factors be continuous (not categorical).

When to Choose DSDs Over PB

Use DSDs when all factors are continuous and you suspect curvature may be important. Use Plackett-Burman when you have categorical factors, when minimizing runs is critical, or when you're confident that a linear model is sufficient for screening purposes.

Comparing Screening Designs

DesignRuns (for 7 factors)Main effectsQuadraticInteractionsFactor types
Plackett-Burman8YesNoConfoundedAny (2-level)
Fractional Factorial (Res III)8Yes (aliased w/ 2FI)NoAliasedAny (2-level)
Fractional Factorial (Res IV)16Yes (clear)NoAliased in pairsAny (2-level)
Definitive Screening15YesYesSomeContinuous only

Quick Decision Guide

Many factors, all 2-level, budget-constrained: Plackett-Burman
Need clean main effects with known alias structure: Fractional factorial
All continuous factors, suspect curvature: Definitive Screening Design
Fewer than 5 factors, each run is cheap: Skip screening, go straight to full factorial or RSM

Screening Workflow with DOE Helper

Complete screening workflow
# Config: 7 factors, Plackett-Burman (8 runs) $ doe info --config screening_config.json Operation: plackett_burman Factors: 7 (2-level each) Base runs: 8 # Generate, run, analyze $ doe generate --config screening_config.json --seed 42 $ bash run_experiments.sh $ doe analyze --config screening_config.json # Result: factors A and C are active. Create a new config for optimization. $ doe info --config optimization_config.json Operation: central_composite Factors: 2 (continuous) Base runs: 13 # Run the optimization study $ doe generate --config optimization_config.json --seed 42 $ bash run_experiments.sh $ doe analyze --config optimization_config.json $ doe optimize --config optimization_config.json

Chapter 6

Response Surface Designs

When screening has identified the key factors, response surface designs add star points and center replicates to fit quadratic models and find the true optimum.

Central Composite Design (CCD)

A CCD consists of three types of experimental points, each serving a distinct purpose:

  1. Factorial points (2k): the corners of the design cube. These estimate linear effects and interactions — the same information a full factorial provides.
  2. Star (axial) points (2k): points along each axis at distance α from center. These provide the additional information needed to estimate quadratic (curvature) terms.
  3. Center replicates (c): repeated runs at the center, providing a direct estimate of experimental error and enabling the lack-of-fit test.

Here's what a CCD looks like for 2 factors:

Central Composite Design — 2 Factors
(-1,-1) (+1,-1) (-1,+1) (+1,+1) (0,-α) (0,+α) (-α,0) (+α,0) Factor x₁ Factor x₂ Factorial (4) Star (4) Center (5)

The factorial points (purple) form a square at the corners. The star points (amber) extend outward along each axis. The center points (green) are stacked at the middle. Together, these 13 runs provide enough information to fit all 6 terms of a two-factor quadratic model (intercept, two linear, one interaction, two quadratic).

FactorsFactorialStarCenterTotalModel terms fitted
2445136
38662010
416873115
5321085021

Why CCD Works: Building On What You Have

A major practical advantage of CCD is that you can build it sequentially. If you've already run a 2k factorial with center points, you don't need to start over — just add the star points. This saves the factorial runs and lets you upgrade from a linear model to a quadratic model incrementally.

The Alpha Parameter in CCD

The star-point distance α controls where the axial points sit relative to the factorial cube. This single parameter determines the shape and statistical properties of the entire design:

Choiceα value (2 factors)PropertyWhen to use
Face-centered1.00Star points on cube faces; only 3 levels needed per factorFactors with only 3 feasible levels; simplest to implement
Rotatable1.414 (= 2k/4)Prediction variance depends only on distance from center, not directionGeneral-purpose; best when you don't know which direction matters most
Spherical1.414 (= √k)All points lie on a sphere of equal radiusWhen the factor space has natural circular/spherical boundaries
Face-Centered (α = 1)
Star points sit on the cube faces. Only 3 levels per factor. Simplest.
Rotatable (α = 1.41)
Star points extend beyond the cube. 5 levels per factor. Equal prediction precision in all directions.

Practical Consideration: Factor Ranges

With a rotatable CCD (α > 1), the star points lie outside the factorial range. For 2 factors, α = 1.414 means the star points are 41% beyond the original range. Make sure these extreme values are physically feasible and safe before generating the design. If they aren't, use face-centered (α = 1).

Box-Behnken Design

Box-Behnken is an alternative RSM design that never tests the corner conditions. Every run has at most two factors at their extreme levels — the remaining factors stay at center. This makes Box-Behnken ideal when the corners represent dangerous, expensive, or physically impossible combinations.

Box-Behnken Design — 3 Factors (15 runs)
Corners (not tested) Edge midpoints (12)

In a 3-factor Box-Behnken, the 12 design points sit at the midpoints of the cube's edges, plus 3 center replicates. No run has all three factors at their extremes simultaneously. Compare this to a CCD, which tests every corner:

PropertyCCD (3 factors)Box-Behnken (3 factors)
Total runs20 (8+6+6)15 (12+3)
Corner conditions tested?Yes (all 8 corners)No
Levels per factor5 (rotatable) or 3 (face-centered)3
Can be built sequentially?Yes (add stars to factorial)No (requires all points)
Quadratic model fittable?YesYes

CCD vs. Box-Behnken: Decision Guide

Use CCD when: Corner conditions are safe and feasible, you want to build on an existing factorial design, or you want the flexibility of choosing α.

Use Box-Behnken when: Corner conditions are dangerous or impossible (e.g., max temperature + max pressure + max flow rate in a reactor), when you want fewer total runs for 3+ factors, or when only 3 levels per factor are practical.

When Corner Conditions Are Dangerous

In a chemical reactor, running at maximum temperature, maximum pressure, and maximum catalyst simultaneously could cause a runaway reaction. In a server cluster, setting all resource parameters to maximum simultaneously could cause cascading failures. Box-Behnken avoids these extremes by design — no run ever has more than two factors at their limits.

Latin Hypercube Sampling

Latin Hypercube Sampling (LHS) is fundamentally different from factorial and RSM designs. Instead of testing specific combinations at discrete levels, LHS generates a space-filling set of points across the full continuous factor range.

For N samples and k factors, LHS divides each factor's range into N equal intervals and places exactly one sample in each interval for every factor:

Random Sampling (Clumps)
Points cluster, leaving large unexplored gaps.
Latin Hypercube (Space-Filling)
Exactly one point per row and column. Uniform coverage guaranteed.

♟ Think of It Like Sudoku

In a Latin Hypercube, each row and column of the sampling grid has exactly one point — like placing non-attacking rooks on a chessboard. This guarantees that no region of the factor space is left unexplored, unlike pure random sampling which can leave large gaps.

When to Use LHS vs. Factorial Designs

SituationUse LHSUse Factorial/CCD
GoalExplore a large space; build surrogate modelsEstimate specific effects; find precise optimum
FactorsMany continuous factors (5+)Few factors (2–5)
LevelsContinuous range; many unique valuesDiscrete: 2–5 levels per factor
Model typeGaussian process, neural net, or any ML modelLinear/quadratic polynomial
Statistical testsNone built-in; need bootstrap or cross-validationANOVA, t-tests, confidence intervals

LHS Cannot Estimate Effects Directly

Unlike factorial designs, LHS does not produce orthogonal columns. You cannot compute clean main effects or interaction effects from LHS data. LHS is best paired with flexible modeling techniques (Gaussian processes, random forests) that don't require orthogonality. If you need traditional DOE analysis (Pareto charts, effect estimates, ANOVA), use factorial or RSM designs.

Quick Reference: Choosing a Design

This is the most important table in the book. Use it to select your design based on what you need to learn:

Question You Need to AnswerDesignTypical Runs
Which of these 10+ factors matter?Plackett-Burman8–20
Screen factors + detect curvature?Definitive Screening2k+1
Main effects + some interactions? (5–8 factors)Fractional Factorial16–32
All effects, no compromises? (2–4 factors)Full Factorial4–16
Robust design with signal-to-noise ratios?Taguchi4–27
Find the quadratic optimum? Corners safe?CCD13–50
Find the quadratic optimum? Corners dangerous?Box-Behnken13–46
Custom run count, maximum information?D-OptimalAny
Uniform coverage of a large continuous space?Latin Hypercube10–100+
Formulation where components sum to 1?Mixture (Simplex)Depends on components

The Design Selection Flowchart

Step 1: Is this a mixture/formulation problem? If yes, use Mixture Simplex designs.
Step 2: How many factors? If > 6, start with screening (PB or Definitive Screening). DSD is preferred when you suspect curvature — it detects it in the same experiment.
Step 3: Do you need interactions? If no, PB is sufficient. If yes, use fractional or full factorial.
Step 4: Do you suspect curvature? Check center points. If center-point mean differs from factorial mean, upgrade to CCD or Box-Behnken.
Step 5: Are corner conditions feasible? If yes, CCD. If no, Box-Behnken.
Step 6: Is the factor space continuous with many dimensions? Consider LHS for initial exploration.
Step 7: Need a specific number of runs? Use D-Optimal to maximize information per run.
Step 8: Need robust design against noise factors? Use Taguchi orthogonal arrays.
Step 9: Need to extend an existing design? Use doe augment with fold-over, star points, or center points.

Worked Example: Chemical Process Optimization

A chemical engineer has identified two critical factors from a screening study: reaction temperature (150–200°C) and catalyst concentration (1–5%). They choose a rotatable CCD (α = 1.414) with 5 center replicates, giving 13 runs total.

RunTypex1 (coded)x2 (coded)Temp (°C)Catalyst (%)Yield (%)
1Factorial-1-1150162
2Factorial+1-1200174
3Factorial-1+1150578
4Factorial+1+1200585
5Star-1.4140139.6366
6Star+1.4140210.4379
7Star0-1.4141750.1758
8Star0+1.4141755.8382
9–13Center00175388, 87, 89, 86, 88

Step 1: Check for Curvature

The average of the 4 factorial points is (62+74+78+85)/4 = 74.75%. The average of the 5 center points is (88+87+89+86+88)/5 = 87.6%. The center-point average is 12.85% higher than the factorial average — strong evidence of quadratic curvature. The optimum lies inside the design space, not at the corners.

The Curvature Test Is Your First Check

If center-point mean ≈ factorial mean, the surface is approximately flat (linear) in this region. Use steepest ascent to move toward the optimum. If center-point mean is significantly higher (or lower), curvature is present and the CCD data can map it. In our example, the 12.85% gap confirms strong curvature.

Step 2: Interpret the Fitted Model

The full quadratic model fitted to all 13 runs yields the prediction equation (see Chapter 8 for complete RSM analysis). The fitted model finds the predicted optimum at approximately 180°C and 3.5% catalyst, with a predicted yield of 89.2%.

Step 3: What the Star Points Reveal

The star points at extreme temperature (runs 5–6) give yields of 66% and 79% — both well below the center points. Similarly, extreme catalyst (runs 7–8) give 58% and 82%. This confirms that both factors have strong downward curvature at the extremes — there really is a peak in the middle.

What If Center Points Weren't Higher?

If the center yields had been ~75% (close to the factorial average), the linear model would be adequate and the CCD star points wouldn't add much value. You'd use the factorial data alone and apply steepest ascent to move toward the optimum. Always check center points first before committing to the full quadratic analysis.

Sequential Experimentation

RSM designs are most powerful when used sequentially, in stages that build on each other. This avoids the common mistake of running an expensive, comprehensive experiment in a region far from the optimum.

The Three-Phase RSM Strategy

  1. Phase 1 — Screen: Run a Plackett-Burman or fractional factorial to identify the 2–4 important factors from the original 8–15 candidates. (Chapter 5)
  2. Phase 2 — First-order model + steepest ascent: Run a 2k factorial with center points on the important factors. If the surface is approximately linear (center ≈ factorial average), use steepest ascent to move toward the optimum. Stop when the response levels off or starts declining.
  3. Phase 3 — Second-order model: Center a CCD or Box-Behnken around the best point from Phase 2. Fit the full quadratic model. Locate the precise optimum. Run confirmation experiments.
Sequential RSM Strategy
Phase 1: Screen PB or Frac. Factorial Find vital few factors Phase 2: Move Factorial + Steepest Ascent Approach the optimum region Phase 3: Optimize CCD or Box-Behnken Map the peak precisely

⛰ The Hill-Climbing Analogy

Imagine finding the peak of a mountain in fog. Phase 1 is looking around to see which directions slope upward (screening). Phase 2 is walking uphill in the steepest direction (steepest ascent). Phase 3 is stopping when the ground levels off and mapping the summit precisely (CCD). This iterative approach is far more efficient than trying to map the entire mountain at once.

How Many Total Runs?

A typical three-phase RSM study on 10 candidate factors:
Phase 1: 12-run Plackett-Burman identifies 3 important factors.
Phase 2: 8-run factorial + 3 center points + 5 steepest ascent runs = 16 runs.
Phase 3: 20-run CCD + 3 confirmation runs = 23 runs.
Total: ~51 runs from "which factors matter?" to "here's the validated optimum." A full factorial on 10 factors at 2 levels would need 1,024 runs — and still wouldn't find the quadratic optimum.


Chapter 7

Analyzing Results

Once you've run your experiments, the analysis phase turns raw data into actionable knowledge. The DOE Helper Tool automates this entire process.

The Pareto Chart

A Pareto chart ranks all effects by magnitude and shows their cumulative contribution. It is the single most important chart in DOE analysis — one glance tells you which factors matter and which don't.

Consider a 23 factorial on web server tuning with factors Thread Count (A), Cache Size (B), and Connection Timeout (C). The analysis produces these effect magnitudes:

EffectMagnitude% ContributionCumulative %
B (Cache Size)18.442.1%42.1%
A (Threads)12.628.8%70.9%
AB7.316.7%87.6%
C (Timeout)2.86.4%94.0%
AC1.43.2%97.2%
BC0.81.8%99.0%
ABC0.41.0%100%

The Pareto chart renders this as descending bars with a cumulative line:

Pareto Chart of Effects — Server Throughput
0 5 10 15 20 B A AB C AC BC 42% 71% 88% 80% 0% 100% Effect Magnitude

How to Interpret This Pareto Chart

The bars (purple) are sorted tallest to shortest. Cache Size (B) dominates, followed by Threads (A). Together with their interaction (AB), these three effects account for 88% of all variation — past the 80% threshold (dashed red line).

The orange cumulative line rises steeply at first, then flattens. Once it crosses 80%, the remaining effects are noise. Factors C, AC, and BC (the grey bars) contribute less than 12% combined — you can fix Connection Timeout at any convenient value.

Action: The AB interaction is large (16.7%), which means Threads and Cache Size don't act independently. You must optimize them together, not separately. This is exactly the insight OVAT would miss.

Main Effects Plots

A main effects plot shows the average response at each level of each factor. It answers the question: "When I change this factor from low to high, what happens to the response on average?"

Using the same web server example, suppose the mean throughput at each factor level is:

FactorLow Level MeanHigh Level MeanEffect
A (Threads): 10 vs 100660 req/s920 req/s+260
B (Cache): 64MB vs 512MB580 req/s1000 req/s+420
C (Timeout): 5s vs 30s770 req/s810 req/s+40
Main Effects Plot — Server Throughput (req/s)
A (Threads) 500 800 1100 660 920 Low High B (Cache Size) 580 1000 Low High C (Timeout) 770 810 Low High

How to Interpret Main Effects Plots

Steep line = large effect. Factor B (Cache Size) has the steepest slope — it produces a 420 req/s improvement when going from low to high. This is the most impactful knob to turn.

Moderate line = moderate effect. Factor A (Threads) also matters, with a 260 req/s gain. Worth optimizing, especially given its interaction with B.

Flat line = negligible effect. Factor C (Timeout) barely changes the response (40 req/s). Set it to whatever is operationally convenient — it won't affect throughput meaningfully.

Direction matters. All three lines slope upward, meaning higher levels improve throughput. If a line sloped downward, the low level would be better for that factor.

Main Effects Plots Can Be Misleading

A main effects plot shows averages across all levels of all other factors. If a strong interaction exists (like AB here), the average can mask what's really happening. Factor A might have a huge effect when B is high but almost no effect when B is low. Always check the interaction plot before drawing conclusions from main effects alone.

Interaction Plots

Interaction plots are the most important diagnostic in DOE. They show whether the effect of one factor depends on the level of another factor. This is information that OVAT experiments can never provide.

Continuing our server example, here are the mean throughput values for all four AB combinations:

B Low (64MB)B High (512MB)
A Low (10 threads)500820
A High (100 threads)6601180
Interaction Plot: A (Threads) × B (Cache Size)
400 600 800 1000 1200 B Low (64 MB) B High (512 MB) 500 820 660 1180 A Low (10 threads) A High (100 threads) Throughput (req/s)

How to Interpret Interaction Plots

Non-parallel lines = interaction. The two lines in this plot diverge — the green line (A High) rises much more steeply than the purple line (A Low). This means the benefit of more cache is amplified when you also have more threads. This is a synergistic interaction.

Read the gap between lines. At B Low, the gap between A Low (500) and A High (660) is 160 req/s. At B High, the gap widens to 360 req/s (820 vs 1180). The effect of threads more than doubles when cache is large.

The practical implication: If you could only change one factor, you'd pick Cache Size (B). But the interaction tells you that changing both together yields 1180 — far more than the sum of their individual effects (160 + 320 = 480 above baseline of 500, giving 980). The extra 200 req/s is pure interaction benefit.

For comparison, here's what different interaction patterns look like and what they mean:

Parallel Lines = No Interaction

Both lines have the same slope. The effect of A is the same regardless of B's level. Each factor acts independently. You can optimize them separately.

Diverging Lines = Synergistic

Lines spread apart. The combined effect is greater than the sum of individual effects. Both factors should be set high (or both low) together. This is the pattern in our example.

Converging Lines = Antagonistic

Lines come together. One factor cancels part of the other's effect. The combined improvement is less than expected. You may need to choose which factor to prioritize.

Crossed Lines = Strong Interaction

Lines cross over. The direction of one factor's effect reverses depending on the other. The main effect average is meaningless here — only the interaction tells the real story. This is the most dangerous pattern to miss.

Statistical Significance

The tool computes 95% confidence intervals for each effect. A confidence interval tells you the range within which the true effect likely falls, accounting for experimental noise. If the interval does not include zero, the effect is statistically significant — meaning it's unlikely to be just noise.

Here's how the confidence intervals look for our server tuning example (with 2 replicates, SE = 2.1):

EffectEstimate95% CISignificant?
B (Cache)18.4[14.0, 22.8]Yes
A (Threads)12.6[8.2, 17.0]Yes
AB7.3[2.9, 11.7]Yes
C (Timeout)2.8[-1.6, 7.2]No (includes 0)
AC1.4[-3.0, 5.8]No
BC0.8[-3.6, 5.2]No

The confidence interval for C (Timeout) spans from -1.6 to 7.2 — it includes zero, so we cannot confidently say Timeout has any real effect. The positive estimate of 2.8 could easily be noise. By contrast, Cache Size's interval [14.0, 22.8] is entirely above zero and far from it, so we're confident it's a real and large effect.

Both Statistically & Practically Significant

Effect: 12.5 ± 3.2

CI: [9.3, 15.7]

Large effect, CI excludes zero. Act on this.

Statistically but Not Practically Significant

Effect: 0.3 ± 0.1

CI: [0.2, 0.4]

CI excludes zero, but the effect is tiny. Probably ignore.

Practical vs. Statistical Significance

Statistical significance depends on sample size. With enough data, even trivially small effects become "statistically significant." Always ask: "Is this effect large enough to matter in my application?" A 0.1% throughput improvement is statistically real but not worth engineering effort. A 20% improvement is worth chasing even if the p-value is marginal. Define your minimum meaningful effect size before the experiment.

Residual Analysis

After fitting a model, always check the residuals (observed minus predicted values). The model assumes residuals are random, independent, and normally distributed with constant variance. If these assumptions fail, your effect estimates and confidence intervals may be unreliable.

There are four key residual plots to check:

1. Residuals vs. Predicted Values

Plot residuals on the Y-axis against predicted (fitted) values on the X-axis. You want a random scatter with no pattern:

Good: Random Scatter
Predicted Residual
Points scattered randomly around zero. Model assumptions hold.
Bad: Funnel Shape
Predicted Residual
Variance increases with predicted value. Try a log transform.
Bad: Curved Pattern
Predicted Residual
U-shape means missing quadratic terms. Add squared terms or use CCD.
Bad: Time Trend
Run Order Residual
Residuals drift over time. Use blocking or randomize run order.

2. Normal Probability Plot of Residuals

Plot residuals against expected normal quantiles. If the model is adequate and the data is well-behaved, points fall along a straight diagonal line. Deviations indicate non-normality:

Good: Points on the Line
Theoretical Quantile
Residuals are approximately normal. Analysis is valid.
Bad: S-Curve (Heavy Tails)
Theoretical Quantile
S-shape = heavy-tailed distribution. Check for outliers.

Residual Analysis Summary

What you want: Random scatter in residuals vs. predicted, straight line in normal plot, no trends over time, no outliers.

What to do if it fails:
Funnel shape: Transform the response (log Y, √Y, or 1/Y). Log is the most common choice.
Curved pattern: Your linear model is missing curvature. Add center points to detect it, or upgrade to CCD/Box-Behnken to fit quadratic terms.
Time trend: Something drifted during the experiment (temperature, tool wear, cache warming). Block by time or re-randomize and repeat.
Outlier: Investigate the specific run. Was there a measurement error? Equipment issue? If you can identify the cause, remove it and note why. If not, keep it — the model should be robust to one outlier.

Multi-Response Analysis

Real experiments often have multiple responses. A database tuning study might measure throughput, latency, and CPU usage. Different responses may have conflicting optima — the settings that maximize throughput might also maximize CPU usage.

Consider a server optimization with two responses:

SettingThroughput (req/s)CPU Usage (%)
A Low, B Low50035%
A High, B Low66052%
A Low, B High82048%
A High, B High118088%

The best throughput (1180) comes with 88% CPU usage — dangerously close to saturation. You need to find the sweet spot.

Strategies for Multiple Responses

1. Overlay plots: Plot contour maps of each response on the same axes. The feasible region is where all responses are within acceptable bounds. Visually identify the best operating point in the overlap zone.

2. Desirability functions: Convert each response to a 0–1 desirability score (0 = unacceptable, 1 = ideal) and maximize the geometric mean. This balances competing objectives automatically. The optimize command supports this approach.

3. Prioritize and constrain: Optimize the primary response (throughput) while constraining secondary responses to acceptable ranges (CPU < 70%). This is often the most practical approach — it frames the problem as "What's the best I can do without breaking anything?"

Effect Size and Power

Before running an experiment, consider: how large an effect do you need to detect? The power of a design is the probability of detecting a real effect of a given size. Higher power requires more replicates.

Power depends on four things:

FactorEffect on PowerYou Control It?
Effect size (signal)Larger effect = easier to detectIndirectly (widen factor ranges)
Noise (std deviation)More noise = harder to detectPartially (better measurement, blocking)
Sample size (replicates)More runs = more powerYes (set block_count)
Significance level (α)Stricter threshold = less powerYes (usually 0.05)

As a rough guide: a 2k design with 2 replicates can detect effects about 2–3 times the noise standard deviation with 80% power. If your effects are likely smaller than that, you need more replicates or a larger design.

The DOE Helper Tool computes power for each factor using the non-central F distribution:

# Check power before running the experiment
doe power --config config.json --sigma 2.0 --delta 5.0

# Or estimate sigma from existing results
doe power --config config.json --delta 5.0 --results-dir results/

If power is below 0.80 for important factors, increase block_count in your config to add replicates, or widen your factor ranges to increase the expected effect size.

🔍 The Signal-to-Noise Analogy

Detecting an effect in a noisy experiment is like hearing a conversation in a loud room. A large effect (loud voice) is easy to detect even with few replicates. A small effect (whisper) requires either a quiet room (low noise) or many listeners (many replicates) to be heard reliably.

Confirmation Runs

After identifying the best settings, always run confirmation experiments at those settings. This is the final validation step before deploying to production. Compare the confirmation results to the model's prediction:

MetricModel PredictionConfirmation ResultVerdict
Throughput1150 ± 85 req/s1120, 1165, 1138Within prediction interval
Latency12.3 ± 2.1 ms11.8, 13.1, 12.4Within prediction interval

If the confirmation results fall within the model's prediction interval, the model is validated and you can proceed with confidence. If they don't:

  • Results consistently higher or lower: The model has a bias. Check for a lurking variable that changed between the experiment and confirmation.
  • Results much more variable: The experimental conditions weren't fully reproduced. Document what differed.
  • Results completely off: The model is inadequate. Consider additional terms, a wider design, or re-examining your assumptions.

Don't Skip Confirmation

A model is a simplification of reality. No matter how good the R², always confirm with real data before committing to production settings. Three confirmation runs is a reasonable minimum. If the stakes are high (production database, safety-critical system), run five or more.

Managing Real-World Experiments

Not all experiments are automated scripts. Physical experiments — lab tests, field measurements, manufacturing trials — require a different workflow. The DOE tool provides three commands specifically for this:

The Manual Experiment Toolkit

export-worksheet generates a printable run sheet (CSV or Markdown) with all factor settings and blank columns for recording responses. status shows a progress bar and tells you which run to do next. record lets you enter results interactively with validation. Together, they replace the test script in the automated workflow.

A typical real-world experiment proceeds in stages, often over days or weeks:

  1. Before the first run: Generate the design, then export a worksheet. Print it or open it in a spreadsheet.
  2. During the experiment: After each run, enter results with record. Use status to see what's left.
  3. Mid-experiment analysis: Use analyze --partial to get early insights. This is especially valuable for expensive experiments — if one factor clearly dominates after half the runs, you can make informed decisions about how to proceed.
  4. After completion: Run the full analysis, optimization, and report pipeline.
Complete manual workflow
# Print a worksheet for the lab $ doe export-worksheet --config config.json --format csv --output runs.csv # After each experiment, record the result $ doe record --config config.json --run 1 $ doe record --config config.json --run 2 # Check progress $ doe status --config config.json # Peek at partial results $ doe analyze --config config.json --partial # When all runs are done $ doe analyze --config config.json $ doe optimize --config config.json $ doe report --config config.json --output report.html

Partial Analysis Caveat

Partial results give directional insights but not definitive conclusions. Effect estimates from incomplete designs may be biased, especially if the missing runs are not randomly distributed. Always confirm findings with the complete dataset.

Putting It All Together: The Analysis Workflow

Here is the recommended sequence for analyzing any DOE result:

  1. Check the Pareto chart. Which factors are in the vital few? Are there any surprises (unexpected interactions)?
  2. Examine main effects plots. What direction does each important factor push the response? How large are the effects in practical units?
  3. Study interaction plots. Do the important factors interact? If lines cross, the main effect averages are misleading — optimize the combination, not the individual factors.
  4. Verify with confidence intervals. Are the important effects statistically significant? Are the unimportant ones truly negligible?
  5. Check residual plots. Is the model adequate? Are the assumptions met? If not, transform the data or upgrade the design.
  6. Run confirmation experiments. Does reality match the model's prediction at the recommended settings?

The Most Common Mistake in DOE Analysis

Looking only at main effects and ignoring interactions. In our server example, if you only looked at main effects, you'd set both A and B to high and expect 660 + 420 = 1080 req/s above the baseline. But the interaction gives you an extra 100 req/s (1180 total), and more importantly, if you had missed the interaction and only changed one factor, you'd capture far less improvement than possible. Always check the interaction plots.


Chapter 8

Response Surface Methodology

RSM moves beyond "which factors matter" to "what is the optimal setting?" by fitting a mathematical model to the experimental data.

From Effects to Models

In Chapters 3–7 you learned to estimate effects — the change in response when a factor moves from low to high. Effects tell you the direction of improvement. But they don't tell you what happens at intermediate settings, or exactly where the optimum lies. That's where RSM comes in.

RSM fits a mathematical model to the experimental data, turning your discrete measurements into a continuous prediction equation. For two factors the full quadratic model is:

ŷ = β0 + β1x1 + β2x2 + β12x1x2 + β11x12 + β22x22

Each β term captures a specific type of behaviour:

TermWhat It ModelsExample
β0Overall average response at the center of the design88.5% yield at 175°C, 3% catalyst
β1x1Linear effect of temperatureHigher temperature increases yield by 6.2 per coded unit
β2x2Linear effect of catalystMore catalyst increases yield by 8.1 per coded unit
β12x1x2Interaction — synergy or antagonismTemperature + catalyst together boost yield by an extra 2.3
β11x12Curvature in the temperature directionYield peaks and then drops at very high temp (β11 < 0)
β22x22Curvature in the catalyst directionYield peaks at moderate catalyst (β22 < 0)

📐 The Terrain Map Analogy

Think of RSM as building a topographic map from a handful of elevation measurements. The linear terms tell you which direction is uphill. The quadratic terms tell you whether the terrain curves — whether there's a peak, a valley, or a saddle. The interaction term tells you whether the slope in one direction changes as you move in the other direction. Once you have the map, you can find the peak without visiting every single point.

First-Order vs. Second-Order Models

Not every RSM study needs quadratic terms. The two model orders serve different purposes:

First-Order (Linear + Interactions)

ŷ = β0 + Σβixi + Σβijxixj

Use when: You're far from the optimum and need to find the direction of steepest ascent. Requires only factorial + center points.

Second-Order (Full Quadratic)

ŷ = β0 + Σβixi + Σβijxixj + Σβiixi2

Use when: You're near the optimum and need to map the peak precisely. Requires CCD or Box-Behnken.

Worked Example: Reactor Yield Optimization

A chemical process team has screened 7 factors down to 2: temperature (x1, 150–200°C) and catalyst concentration (x2, 1–5%). They run a CCD with 5 center replicates (13 runs total). After fitting the full quadratic model, the optimize command reports:

RSM Output
# Fitted Model (coded units): ŷ = 87.6 + 6.2·x₁ + 8.1·x₂ + 2.3·x₁x₂ - 5.8·x₁² - 4.1·x₂² # Model Fit: R² = 0.964 R²(adj) = 0.938 Lack-of-fit p-value = 0.31 (not significant → model is adequate) # Stationary Point (coded): x₁ = +0.48, x₂ = +0.85 # Stationary Point (natural units): Temperature = 187°C, Catalyst = 3.7% # Predicted optimum: ŷ = 92.1% yield

Let's walk through how to interpret every part of this output.

Reading the Coefficients

CoefficientValueInterpretation
β0 = 87.6InterceptAt the center point (175°C, 3%), predicted yield is 87.6%
β1 = +6.2Temperature (linear)Moving from center to high temperature adds 6.2% yield
β2 = +8.1Catalyst (linear)Catalyst has a larger linear effect than temperature
β12 = +2.3InteractionMild synergy — both high together gives a bonus 2.3%
β11 = -5.8Temperature (quadratic)Negative → yield curves downward at extreme temperatures. There's a peak.
β22 = -4.1Catalyst (quadratic)Negative → yield also curves downward with extreme catalyst. There's a peak.

Both quadratic coefficients are negative, which means the response surface is dome-shaped — there's a true maximum inside the experimental region. This is the ideal outcome for RSM.

Model Adequacy

Before trusting any prediction, you must verify that the model actually fits the data. The tool reports several diagnostics automatically:

  • R² and Adjusted R² — Overall model fit.
  • PRESS statistic and Predicted R² — Leave-one-out cross-validation measure that penalizes influential observations. A large gap between R² and Predicted R² indicates overfitting.
  • ANOVA table — F-tests and p-values for each term, lack-of-fit test, and Lenth's PSE for unreplicated designs.
  • Diagnostic plots — Residuals vs. fitted values (check for patterns), normal probability plot of residuals (check for normality), residuals vs. run order (check for drift), and predicted vs. actual (check for bias). These are generated automatically by doe analyze.

R² and Adjusted R²

InterpretationAction
> 0.9Excellent fit — model explains most variationProceed with optimization
0.7 – 0.9Good fit — useful for optimizationProceed, but widen confirmation runs
0.5 – 0.7Moderate — model is missing somethingCheck for missing quadratic terms or transform the response
< 0.5Poor fit — the model doesn't describe the dataRedesign the experiment: wider range, more factors, different model

In our reactor example, R² = 0.964 means the model explains 96.4% of the variation in yield. Adjusted R² = 0.938 penalizes for the number of terms — still high, confirming that the terms in the model are justified, not just overfitting noise.

R² vs. Adjusted R²: When to Worry

If R² = 0.95 but adjusted R² = 0.72, you have too many terms for the amount of data. Remove non-significant terms (usually the interaction first, then the smaller quadratic) and refit. A gap greater than ~0.10 between R² and adjusted R² is a red flag for overfitting.

ANOVA Table

The analysis of variance breaks the model into individual terms and tests each one:

SourceSum of SquaresdfF-valuep-valueSignificant?
x1 (Temp, linear)307.4138.20.0003Yes
x2 (Cat, linear)524.9165.3<0.0001Yes
x1x221.212.60.15No
x1²198.3124.70.0011Yes
x2²99.1112.30.0079Yes
Lack of fit15.431.50.31Not significant (good)
Pure error13.64

How to Read the ANOVA Table

Significant terms (p < 0.05): Both linear effects and both quadratic effects are significant — these terms belong in the model.

Non-significant terms: The interaction x1x2 (p = 0.15) is borderline. You could remove it to simplify the model, but keeping a mild interaction term usually doesn't hurt prediction accuracy.

Lack of fit (p = 0.31): This is not significant, which is what you want. It means the quadratic model fits the data adequately — there's no evidence of a pattern that the model misses. If this were significant (p < 0.05), you'd need a more complex model.

Finding the Optimum

The optimize command reports three things: the best observed run (the highest actual measurement), the predicted optimum at observed points from the RSM model, and the true surface optimum found by numerically optimizing the fitted polynomial using L-BFGS-B with multiple random restarts.

Best Observed Run

Temperature: 175°C

Catalyst: 3%

Yield: 89.0%

This is a real measurement from a center-point replicate. It's trustworthy but might not be the true optimum — you only tested discrete points.

Predicted Optimum (RSM)

Temperature: 187°C

Catalyst: 3.7%

Yield: 92.1% ± 2.4

This is the model's prediction at an untested point. It's the peak of the fitted surface. Must be validated with confirmation runs.

The predicted optimum is 3.1% better than the best observed run, and it occurs at a setting nobody tested directly. This is the power of RSM — interpolation between experimental points to find a better answer than any individual run.

Method of Steepest Ascent

If your first-order model suggests the optimum is far outside your experimental region (the contour lines are nearly parallel, pointing off the edge of the plot), use the method of steepest ascent before committing to a full CCD.

  1. Fit a first-order (linear) model to your factorial data.
  2. Compute the gradient direction: the path of steepest increase in the predicted response.
  3. Run a few experiments along that path, stepping outward from the center, until the response stops improving.
  4. Center a new CCD around the best point found on the path.

The DOE Helper Tool automates steps 1–2 with the --steepest flag:

doe optimize --config config.json --steepest

This generates a table of recommended follow-up experiment points along the gradient direction, with predicted response values at each step.

⛰ Hill-Climbing in Fog

You're on a mountainside in fog. You can't see the summit, but you can feel the slope under your feet. Steepest ascent says: walk uphill in the steepest direction. When the ground levels off (response stops improving), stop and survey the area carefully (run a CCD). This two-phase approach avoids wasting a full CCD in a region far from the peak.

Contour Plots and Surface Plots

The fitted RSM model can be visualized as a contour plot (2D topographic map) or a surface plot (3D wireframe). Contour plots are more practical for decision-making because you can read exact coordinates from them.

Here is the contour plot for our reactor example, showing predicted yield as a function of temperature and catalyst concentration:

Contour Plot — Predicted Yield (%) vs. Temperature × Catalyst
Temperature (°C) 150 162 175 188 200 Catalyst (%) 1 2 3 4 5 70% 78% 84% 88% 91% 92.1% (187°C, 3.7%) Factorial points Star points Center replicates Predicted optimum

How to Read This Contour Plot

Each contour line connects points with the same predicted yield. The labels (70%, 78%, 84%, 88%, 91%) show the yield at each contour level. Moving inward toward the center of the concentric ellipses means higher yield.

The red dot marks the predicted optimum (92.1% at 187°C, 3.7% catalyst). It lies inside the innermost contour, slightly above and to the right of the design center.

The ellipse tilt reveals the interaction. The contours are tilted about 12° from horizontal, meaning temperature and catalyst have a mild synergy (the AB interaction). If there were no interaction, the ellipses would be aligned with the axes.

Contour spacing tells you sensitivity. Contours are more tightly packed in the temperature direction (left-right), meaning yield is more sensitive to temperature near the edges of the region. Along the catalyst axis (up-down), contours are more spread out, meaning yield changes more gradually.

Here's what different contour patterns mean and how to act on them:

Concentric Ellipses: Clear Peak
Optimum is inside the region. Run confirmation at the predicted peak.
Parallel Lines: Ridge System
Only one factor matters here. The other has no curvature. Set the flat factor to any convenient value.
Saddle Point: No True Optimum
Eigenvalues have mixed signs. The best point is on the boundary. Use ridge analysis.
Open Contours: Optimum Outside
The optimum is beyond your tested range. Use steepest ascent to move toward it, then re-center a new CCD.

Canonical Analysis

Canonical analysis transforms the fitted model into a simpler form that reveals the nature of the response surface at its stationary point. The quadratic RSM model can be written in matrix form as:

ŷ = ŷs + λ1w12 + λ2w22

where ŷs is the predicted response at the stationary point, λi are the eigenvalues of the quadratic coefficient matrix B, and wi are the rotated coordinates. The original matrix form is:

ŷ = β0 + x'b + x'Bx    ⇒    xs = -½ B-1b

The eigenvalues tell you everything about the shape of the surface at the stationary point:

EigenvaluesSurface ShapeWhat It Looks LikeAction
All negativeMaximum (hill-top)A dome — response decreases in every direction from the peakThe stationary point is your optimum. Confirm it.
All positiveMinimum (valley)A bowl — response increases in every direction from the low pointIf minimizing, this is your optimum. If maximizing, the best point is on the boundary.
Mixed signsSaddle pointA horse saddle — a peak in one direction but a valley in the otherNo true optimum exists. Use ridge analysis to find the best point on the boundary.
Some near zeroRidge systemA mountain ridge — flat along one direction, curved in anotherMultiple near-optimal points exist. Pick the one that's most practical or robust.

For our reactor example:

Canonical Analysis Output
# Eigenvalues of B matrix: λ₁ = -6.13 (temperature direction, rotated) λ₂ = -3.77 (catalyst direction, rotated) # Both negative → surface is a MAXIMUM (dome) # Predicted response at stationary point: 92.1% # Stationary point in coded units: x₁=+0.48, x₂=+0.85 # Stationary point in natural: Temp=187°C, Cat=3.7%

Both eigenvalues are negative, confirming a true maximum. The magnitude ratio |λ12| = 1.63 tells us the surface is about 63% more curved in the temperature direction than in the catalyst direction — meaning yield is more sensitive to temperature deviations from the optimum. This has practical implications: you should control temperature more tightly than catalyst concentration in production.

Eigenvalue Magnitudes Matter

The absolute magnitude of each eigenvalue tells you how sensitive the response is along that direction. Large |λ| means steep curvature — small deviations from the optimum cause big response changes. Small |λ| means a flat ridge — you have more tolerance. When setting process tolerances, allocate tighter control to factors aligned with the largest eigenvalue.

Ridge Analysis

In roughly 30–40% of real RSM studies, the stationary point falls outside the experimental region or is a saddle point. In either case, the stationary point is not directly useful. Ridge analysis solves this by finding the best predicted response at each distance from the center of the design.

Think of it as drawing concentric circles (or spheres) outward from the center and asking: "What's the highest yield I can achieve on each circle?"

🎯 The Fenced Garden Analogy

If the highest point on your property is actually outside the fence, ridge analysis finds the highest point on the fence — the best you can achieve within your constraints. You can then decide whether to extend the fence (expand the experimental region) in a follow-up study.

A ridge analysis trace for a saddle-point example might look like:

Radius (coded)x1x2Predicted Response
0.00.000.0085.2
0.2-0.08+0.1886.1
0.4-0.16+0.3687.8
0.6-0.22+0.5689.9
0.8-0.27+0.7591.5
1.0-0.30+0.9592.4
1.2-0.32+1.1492.8 (outside region)

Don't Extrapolate Beyond the Design

The model is fitted to data within the experimental region (coded -1 to +1). Predictions at radius > 1.0 are extrapolations — the model has no data there and may be wildly wrong. If the ridge trace shows the response still climbing at radius = 1.0, that's a signal to run a follow-up experiment centered further along the ridge direction.

The Method of Steepest Ascent

When you're far from the optimum, steepest ascent is the most efficient way to move toward it. The idea is simple: use the first-order model's coefficients to determine the direction of fastest improvement, then run a few experiments along that path.

From a first-order model with coefficients β1 = 6.2 (temperature) and β2 = 8.1 (catalyst), the steepest ascent path moves in the direction proportional to these coefficients. In natural units, you step along:

StepTemp (°C)Catalyst (%)Observed Yield (%)
Center1753.082.4
+1 step1833.686.1
+2 steps1914.288.7
+3 steps1994.889.3 (best)
+4 steps2075.487.0 (declining)

Yield peaked at step 3 and declined at step 4 — you've passed the optimum. Center a new CCD around (199°C, 4.8%) to map the peak precisely.

When to Use Steepest Ascent vs. Direct CCD

Use steepest ascent when: Your initial factorial or screening study shows mostly linear effects with little curvature (center points close to factorial average). You're probably far from the peak and need to move toward it.

Skip to CCD directly when: Center points are noticeably higher (or lower) than the factorial average — this signals curvature, meaning you're already near the peak. A CCD will map it directly.

Lack-of-Fit Testing

If you have replicated center points (and you should — always include 3–5), you can formally test whether the quadratic model fits the data. The lack-of-fit test compares two sources of variation:

  • Pure error: Variation among replicates at the same settings (estimated from center points). This is "real" noise — the inherent variability of the process.
  • Lack of fit: Variation between model predictions and observed means at each unique design point. This is "systematic" error — patterns the model fails to capture.

The F-test asks: "Is the model's prediction error significantly larger than the pure experimental noise?"

Lack-of-Fit NOT Significant (p > 0.05)

The model fits the data adequately. Prediction errors are no larger than random noise. Proceed with optimization using this model.

Our reactor example: p = 0.31. The quadratic model is sufficient.

Lack-of-Fit IS Significant (p < 0.05)

The model is missing something systematic. Predictions deviate from reality in a patterned way. Do not trust the predicted optimum.

Remedies: Transform Y (log, sqrt), add cubic terms, restrict the region, or try a different model.

Multi-Response Optimization with Desirability

Most real processes have multiple responses to optimize simultaneously. A reactor produces yield, purity, and cost. A server has throughput, latency, and CPU usage. The settings that maximize one response often worsen another.

The desirability function approach converts each response to a 0–1 scale (0 = unacceptable, 1 = ideal), then combines them into a single score:

D = (d1w1 × d2w2 × … × dkwk)1/Σw

where di is the individual desirability for response i, and wi is its importance weight.

Desirability Functions by Goal

Goaldi = 0 whendi = 1 whenExample
MaximizeResponse ≤ lower boundResponse ≥ targetYield: d = 0 below 70%, d = 1 above 95%
MinimizeResponse ≥ upper boundResponse ≤ targetCost: d = 0 above $50, d = 1 below $20
TargetResponse outside boundsResponse = targetpH: d = 1 at exactly 7.0, d = 0 outside 6.0–8.0

Continuing the reactor example with three responses:

ResponseGoalLowerTarget/UpperWeight
Yield (%)Maximize70953 (high priority)
Purity (%)Maximize90992
Cost ($/kg)Minimize15401 (low priority)

Interpreting the Overall Desirability

D = 1.0: All responses have hit their ideal targets simultaneously — rare in practice.
D > 0.8: Excellent compromise. All responses are close to their targets.
D = 0.5–0.8: Acceptable compromise. Some responses are trading off against others.
D < 0.5: At least one response is far from its target. Revisit the weights or accept that the goals conflict strongly.
D = 0: At least one response is completely unacceptable. The geometric mean makes D = 0 if any individual di = 0 — this is a safety feature that prevents accepting solutions where any single response is terrible.

Using Multi-Objective Optimization in the DOE Helper Tool

The --multi flag on doe optimize implements Derringer-Suich desirability automatically. It fits RSM models for each response, evaluates desirability across the factor space, and reports the best compromise.

Multi-objective optimization
# Basic multi-objective (equal weights, auto bounds) $ doe optimize --config config.json --multi

To control the trade-off, add weight and bounds to responses in your config:

config.json — weighted responses
"responses": [ {"name": "yield", "optimize": "maximize", "weight": 3, "bounds": [70, 95]}, {"name": "purity", "optimize": "maximize", "weight": 2, "bounds": [90, 99]}, {"name": "cost", "optimize": "minimize", "weight": 1, "bounds": [15, 40]} ]
  • weight (default 1.0) — relative importance. A weight of 3 means "this response matters 3× more" in the geometric mean.
  • bounds (optional) — [worst, best] values for desirability scaling. For "maximize," the first value is the worst acceptable (d=0) and the second is the ideal (d=1). For "minimize," the first is ideal and the second is the worst. If omitted, bounds are auto-computed from observed data.

The output includes:

  • Overall desirability D — the weighted geometric mean across all responses
  • Per-response desirability — how well each response meets its target
  • Recommended settings — either from observed runs or RSM-predicted optima
  • Trade-off summary — how much each response sacrifices compared to its individual best
  • Top 3 runs — the best observed factor combinations ranked by D

Choosing Weights Carefully

Weights express relative priority, not absolute importance. If you're unsure, start with equal weights (all 1.0) and examine the trade-offs. Then adjust: if the compromise sacrifices too much yield, increase yield's weight and re-run. Iterate until the trade-off is acceptable. Document your final weights — they encode a business decision about what matters most.

The Complete RSM Workflow

Bringing together everything in this chapter, here's the recommended sequence:

  1. Start with a first-order model. Fit a factorial or PB design. Check center points for curvature. If center-point mean ≈ factorial mean, the surface is flat — use steepest ascent to move toward the optimum.
  2. Run steepest ascent. Take 3–6 steps along the gradient. Stop when the response starts declining or leveling off.
  3. Center a CCD around the best point. Run a CCD (or Box-Behnken if corners are infeasible). Fit the full quadratic model.
  4. Check model adequacy. R² > 0.9? Lack of fit not significant? Residuals look random? If yes, proceed. If no, diagnose and fix.
  5. Interpret the contour plot. Concentric ellipses with the optimum inside? Great. Saddle or open contours? Use ridge analysis or expand the region.
  6. Run canonical analysis. Confirm the stationary point type. Check eigenvalue magnitudes for sensitivity information.
  7. For multiple responses, compute desirability. Set targets and weights in your config, then run doe optimize --config config.json --multi to find the best compromise.
  8. Confirm. Run 3–5 experiments at the predicted optimum. Compare to the prediction interval.

RSM Is Not Machine Learning

RSM fits simple polynomial models (usually 5–15 coefficients) to small datasets (13–50 runs). It works because the experimental region is small and the physics is smooth. Don't expect it to capture complex, highly nonlinear relationships across a vast design space — that's a different problem requiring different tools. RSM's strength is precise optimization in a local region with statistical guarantees, not global exploration.


Chapter 9

Practical Troubleshooting

Common issues and how to resolve them.

"All my effects look the same size"

Likely cause: Factor ranges are too narrow, or the response is dominated by noise. Fix: Widen factor ranges (use more extreme levels) or add replicates to reduce noise.

"The confidence intervals are huge"

Likely cause: Too few replicates or high experimental noise. Fix: Increase block_count to add replicates, or reduce measurement variability.

"My optimizer says linear R² is low"

Likely cause: The true relationship is nonlinear (quadratic curvature). Fix: Switch from a screening design to CCD or Box-Behnken, which can fit quadratic terms.

"Some runs failed"

The generated runner scripts handle failures gracefully — they track failed runs and continue. After fixing the issue, re-run only the missing runs (the skip logic detects existing result files). Use --partial with analyze, optimize, or report to work with incomplete data:

Terminal
$ doe analyze --config config.json --partial # Partial mode: analyzing 5/8 completed runs (missing: 3, 6, 7)
"I need more than 2 levels for screening"

Plackett-Burman and fractional factorial require exactly 2 levels. For multi-level screening, use a full factorial with a manageable subset of factors, or use Latin Hypercube for continuous factors.

"My model predicts values outside the physical range"

Likely cause: A linear or quadratic model extrapolating beyond the experimental region, or the response has a natural bound (e.g., yield can't exceed 100%). Fix: Apply a transformation (logit for proportions, log for positive-only values) before fitting the model, or constrain predictions to the feasible range during optimization.

"I have categorical and continuous factors mixed together"

This is common (e.g., algorithm type + buffer size). Use a full or fractional factorial design where categorical factors appear as coded levels. The tool handles mixed factor types — just set type: "categorical" or type: "continuous" in the config. Note: CCD and Box-Behnken require continuous factors, so you may need to run separate RSM studies for each level of the categorical factor.

"I don't have a test script — my experiments are physical/manual"

The tool fully supports manual experiments. Leave test_script empty in your config and use the manual workflow:

  1. Export a worksheet: doe export-worksheet --config config.json --format csv --output worksheet.csv
  2. Check progress: doe status --config config.json
  3. Record results: doe record --config config.json --run 1
  4. Analyze incrementally: doe analyze --config config.json --partial

The analysis pipeline doesn't care how results were produced — only that run_N.json files exist with the right response keys.

"My results aren't reproducible between days"

Likely cause: Day-to-day environmental variation (ambient temperature, system load, different hardware). Fix: Use blocking — set block_count: 2 (or more) so each day is a block. Randomize run order within each block. The analysis separates block effects from treatment effects.

Common Mistakes Checklist

Before running your experiment, verify you haven't made these common errors:

MistakeConsequencePrevention
Running factors one at a timeMissing interactions; biased effect estimatesUse a proper design (factorial, PB, etc.)
Not randomizing run orderTime trends confounded with factor effectsAlways randomize; the tool does this by default
Factor ranges too narrowAll effects look insignificantUse bold ranges (at least ±25% of current value)
Factor ranges too wideNonlinear behavior; model breaks down at extremesCheck feasibility of extreme combinations first
No replicatesNo error estimate; can't assess significanceAdd center points or use block_count > 1
Changing unmeasured variables between runsConfounded effectsDocument and control all known nuisance variables
Ignoring the response metric definitionMeasuring the wrong thingDefine the response precisely before running (mean? median? p99?)

The DOE Workflow in Practice

Here is the complete workflow, from initial idea to validated optimum:

  1. Define the objective. What are you optimizing? What constitutes success? Agree on quantifiable responses.
  2. List candidate factors. Brainstorm with domain experts. Include factors you're not sure about — screening will sort them out.
  3. Set factor levels. Choose bold but feasible ranges. The levels should be far enough apart that real effects are detectable.
  4. Choose the design. Many factors? Screen first (PB). Few factors? Go straight to factorial or CCD.
  5. Randomize and run. Follow the generated run order. For automated experiments, use the generated runner script. For manual experiments, use export-worksheet to print a run sheet, status to track progress, and record to enter results. Record everything, including conditions you didn't plan to vary.
  6. Analyze. Use --partial to get early insights while experiments are still running. Compute effects, check residuals, identify the vital few factors.
  7. Follow up. If screening, run a factorial or RSM on the important factors. If optimizing, run confirmation experiments.
  8. Confirm and implement. Validate the optimum with independent runs. Then deploy.

Iteration Is Normal

DOE is rarely a one-shot process. Most real optimization studies require 2–3 rounds: screening, then characterization, then fine-tuning. Each round informs the next. The total cost is still far less than exhaustive search, and the conclusions are statistically defensible.


User Guide

DOE Helper Tool — Software Reference

This section is the complete software reference for the DOE Helper Tool. It covers installation, configuration, every CLI command, and step-by-step tutorials to put the theory from Chapters 1–9 into practice.

Installation & Setup

The DOE Helper Tool is a Python package that works on any system with Python 3.10 or later.

pip install doehelper

Verify the installation:

doe info

This prints the installed version, available design types, and default configuration values. For full installation details — including virtual-environment setup, shell completion, and upgrading — see the Installation & Setup reference.

Configuration Reference

Every experiment starts with a JSON configuration file that describes your factors, responses, and design choices. Here is a minimal example:

{
  "factors": [
    {"name": "temperature", "levels": ["60", "80"], "type": "continuous"},
    {"name": "pressure", "levels": ["1.0", "2.0"], "type": "continuous"}
  ],
  "responses": [
    {"name": "yield", "optimize": "maximize"}
  ],
  "settings": {
    "operation": "full_factorial"
  }
}

The tool supports 11 design types: full_factorial, fractional_factorial, plackett_burman, latin_hypercube, central_composite, box_behnken, definitive_screening, taguchi, d_optimal, mixture_simplex_lattice, and mixture_simplex_centroid.

The configuration supports many more fields — blocking, center points, custom factor levels, optimization targets, response weights for multi-objective optimization, and more. The complete field-by-field reference is in the Configuration Reference.

CLI Commands

The tool provides a set of CLI commands that map to each phase of the DOE workflow described in Chapter 9:

CommandPurposeWorkflow Phase
doe generateCreate a design matrix and runner scriptPlan & execute
doe analyzeCompute effects, ANOVA, diagnostics, and all plotsAnalyze
doe optimizeFind optimal factor settings (surface optimization, steepest ascent, multi-objective)Optimize
doe powerCompute statistical power for each factorPlan
doe augmentExtend a design with fold-over, star points, or center pointsPlan & execute
doe reportGenerate an HTML report with all charts, ANOVA tables, and diagnosticsReport
doe infoShow design summary, alias structure, and D/A/G-efficiency metricsSetup
doe recordManually enter a result for a specific runExecute (manual)
doe statusShow progress of an in-flight experimentExecute
doe export-worksheetExport a printable run sheet for manual experimentsExecute (manual)

Every command accepts --help for a full list of options. For detailed usage, examples, and flags for each command, see the CLI Command Reference.

Tutorials

The best way to get comfortable with the tool is to work through a hands-on example. The Tutorials section walks you through complete experiments from configuration to final report, tying together the theory from this book with the software commands above.

For a broader collection of real-world scenarios — from web-server tuning to machine-learning hyperparameter search — explore the Use Cases library.

Quick Reference

The full Quick Start & Reference page is designed to work as a standalone reference you can keep open while running experiments. Everything in this User Guide section links there for the complete details.