ML Hyperparameter Screening

Summary

This experiment investigates ml hyperparameter screening. Fractional factorial to screen 5 hyperparameters with minimal runs.

The design varies 5 factors: learning rate, ranging from 0.001 to 0.1, batch size, ranging from 32 to 256, dropout, ranging from 0.1 to 0.5, hidden layers, ranging from 2 to 6, and optimizer, ranging from sgd to adam. The goal is to optimize 2 responses: accuracy (%) (maximize) and training time (sec) (minimize). Fixed conditions held constant across all runs include epochs = 50, dataset = cifar10.

A fractional factorial design reduces the number of runs from 32 to 8 by deliberately confounding higher-order interactions. This is ideal for screening — identifying which of the 5 factors matter most before investing in a full study.

Key Findings

For accuracy, the most influential factors were learning rate (34.8%), dropout (32.5%), batch size (22.6%). The best observed value was 78.1 (at learning rate = 0.001, batch size = 256, dropout = 0.1).

For training time, the most influential factors were learning rate (33.2%), dropout (29.0%), batch size (23.3%). The best observed value was 77.3 (at learning rate = 0.001, batch size = 256, dropout = 0.1).

Recommended Next Steps

Follow up with a response surface design (CCD or Box-Behnken) on the top 3–4 factors to model curvature and find the true optimum.
Consider whether any fixed factors should be varied in a future study.
The screening results can guide factor reduction — drop factors contributing less than 5% and re-run with a smaller, more focused design.

The Scenario

You are training a deep learning model and need to screen 5 hyperparameters to find which ones matter most. A full factorial would require 2⁵ = 32 runs, and each training run takes significant GPU time. A fractional factorial cuts this in half.

ℹ

Why Fractional Factorial?

Resolution III fractional factorial uses only 16 runs. You're screening — you just need to know which hyperparameters have the largest effects. Some main effects are aliased with 2-factor interactions, but follow-up experiments can resolve ambiguities.

Experimental Setup

Factors

Factor	Low	High	Type
`learning_rate`	0.001	0.1	continuous
`batch_size`	32	256	continuous
`dropout`	0.1	0.5	continuous
`hidden_layers`	2	6	continuous
`optimizer`	sgd	adam	categorical

Fixed: epochs = 50, dataset = cifar10

Responses

Response	Direction	Unit
`accuracy`	↑ maximize	%
`training_time`	↓ minimize	sec

✔

Conflicting objectives

Look for factors that improve accuracy without hurting training time — those are the "free wins."

Experimental Matrix

The Fractional Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.

Run	`learning_rate`	`batch_size`	`dropout`	`hidden_layers`	`optimizer`
1	0.001	256	0.5	2	sgd
2	0.1	32	0.1	2	sgd
3	0.1	256	0.1	6	sgd
4	0.1	256	0.5	6	adam
5	0.001	256	0.1	2	adam
6	0.1	32	0.5	2	adam
7	0.001	32	0.1	6	adam
8	0.001	32	0.5	6	sgd

Step-by-Step Workflow

Complete workflow

# Preview (notice: 16 runs instead of 32)
$ doe info --config use_cases/03_ml_hyperparameter_screening/config.json

# Generate with positional arg style
$ doe generate --config use_cases/03_ml_hyperparameter_screening/config.json \
    --output results/run.sh --seed 7

# Run the simulated training
$ bash results/run.sh

# Analyze both responses
$ doe analyze --config use_cases/03_ml_hyperparameter_screening/config.json

# Optimize and report
$ doe optimize --config use_cases/03_ml_hyperparameter_screening/config.json
$ doe optimize --config use_cases/03_ml_hyperparameter_screening/config.json --multi  # multi-objective
$ doe report --config use_cases/03_ml_hyperparameter_screening/config.json \
    --output results/report.html

⚠

Positional argument style

This use case uses positional args: factors are passed in order without flag names. The runner script calls sim.sh 0.001 32 0.1 2 sgd --out run_1.json. Useful when your training script expects ordered arguments.

Features Exercised

Feature	Value
Design type	`fractional_factorial`
Factor types	`continuous` (4) + `categorical` (1)
Arg style	`positional`
Run reduction	16 runs instead of 32 (50% savings)
Multi-response	accuracy ↑, training_time ↓
`--seed`	7 (reproducible run order)

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: accuracy

The Pareto chart identifies which hyperparameters contribute most to model accuracy.

Pareto Chart

Main Effects Plot

Response: training_time

Training time is driven by a different set of hyperparameters, revealing trade-offs with accuracy.

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

Flat surface — these two factors have little effect on the response.
Tilted plane — strong linear effect; moving along one axis consistently changes the response.
Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
Saddle shape — significant interaction; the best setting of one factor depends on the other.
Red dots far from surface — poor model fit in that region; be cautious about predictions there.

accuracy (%) — R² = 1.000, Adj R² = 1.000
The model fits well — the surface shape is reliable.
Curvature detected in learning_rate, batch_size — look for a peak or valley in the surface.
Strongest linear driver: batch_size (decreases accuracy).
Notable interaction: learning_rate × hidden_layers — the effect of one depends on the level of the other. Look for a twisted surface.

training_time (sec) — R² = 1.000, Adj R² = 1.000
The model fits well — the surface shape is reliable.
Curvature detected in learning_rate, batch_size — look for a peak or valley in the surface.
Strongest linear driver: batch_size (increases training_time).
Notable interaction: learning_rate × hidden_layers — the effect of one depends on the level of the other. Look for a twisted surface.

accuracy: batch size vs dropout

accuracy: batch size vs hidden layers

accuracy: dropout vs hidden layers

accuracy: learning rate vs batch size

accuracy: learning rate vs dropout

accuracy: learning rate vs hidden layers

training: time batch size vs dropout

training: time batch size vs hidden layers

training: time dropout vs hidden layers

training: time learning rate vs batch size

training: time learning rate vs dropout

training: time learning rate vs hidden layers

Full Analysis Output

doe analyze
=== Main Effects: accuracy ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
optimizer               -7.8675       3.0940            33.5%
hidden_layers            7.2775       3.0940            31.0%
learning_rate           -6.9425       3.0940            29.6%
dropout                  0.7725       3.0940             3.3%
batch_size              -0.6125       3.0940             2.6%

=== Interaction Effects: accuracy ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
batch_size           optimizer                  9.9875            18.3%
dropout              hidden_layers              9.9875            18.3%
learning_rate        dropout                    7.8675            14.4%
learning_rate        batch_size                -7.2775            13.3%
batch_size           hidden_layers              6.9425            12.7%
dropout              optimizer                  6.9425            12.7%
batch_size           dropout                   -2.0625             3.8%
hidden_layers        optimizer                 -2.0625             3.8%
learning_rate        optimizer                 -0.7725             1.4%
learning_rate        hidden_layers              0.6125             1.1%

=== Summary Statistics: accuracy ===

learning_rate:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.001               4    70.7525     7.2472    59.9100    74.9300
  0.1                 4    63.8100     9.6972    57.4100    78.1000

batch_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 4    67.5875    10.4395    57.4100    78.1000
  32                  4    66.9750     8.3340    58.1600    74.5000

dropout:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.1                 4    66.8950     8.7327    57.4100    74.9300
  0.5                 4    67.6675    10.1010    58.1600    78.1000

hidden_layers:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2                   4    63.6425     7.6527    58.1600    74.9300
  6                   4    70.9200     9.2096    57.4100    78.1000

optimizer:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  adam                4    71.2150     8.9006    58.1600    78.1000
  sgd                 4    63.3475     7.6291    57.4100    74.5000

=== Main Effects: training_time ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
optimizer               19.3500       7.2462            33.6%
hidden_layers          -14.7000       7.2462            25.5%
learning_rate           14.2000       7.2462            24.7%
dropout                 -9.2500       7.2462            16.1%
batch_size              -0.1000       7.2462             0.2%

=== Interaction Effects: training_time ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
batch_size           optimizer                -24.0500            18.9%
dropout              hidden_layers            -24.0500            18.9%
learning_rate        dropout                  -19.3500            15.2%
learning_rate        batch_size                14.7000            11.5%
batch_size           hidden_layers            -14.2000            11.1%
dropout              optimizer                -14.2000            11.1%
learning_rate        optimizer                  9.2500             7.3%
batch_size           dropout                    3.7500             2.9%
hidden_layers        optimizer                  3.7500             2.9%
learning_rate        hidden_layers              0.1000             0.1%

=== Summary Statistics: training_time ===

learning_rate:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.001               4    98.6000    15.6327    86.1000   121.2000
  0.1                 4   112.8000    24.5218    77.3000   133.7000

batch_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 4   105.7500    26.1586    77.3000   133.7000
  32                  4   105.6500    17.2003    86.1000   120.5000

dropout:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.1                 4   110.3250    20.2307    90.8000   133.7000
  0.5                 4   101.0750    22.6672    77.3000   121.2000

hidden_layers:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2                   4   113.0500    14.8460    90.8000   121.2000
  6                   4    98.3500    24.8126    77.3000   133.7000

optimizer:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  adam                4    96.0250    17.6872    77.3000   119.7000
  sgd                 4   115.3750    20.4371    86.1000   133.7000

Optimization Recommendations

doe optimize
=== Optimization: accuracy ===
Direction: maximize

Best observed run: #4
  learning_rate = 0.001
  batch_size = 32
  dropout = 0.5
  hidden_layers = 6
  optimizer = sgd
  Value: 78.1

RSM Model (linear, R² = 0.76):
  Coefficients:
    intercept:  +67.2812
    learning_rate:  -4.2637
    batch_size:  +0.4462
    dropout:  -3.0338
    hidden_layers:  +4.7863
    optimizer:  +0.4937
  Predicted optimum:
    learning_rate = 0.001
    batch_size = 32
    dropout = 0.1
    hidden_layers = 6
    optimizer = adam
    Predicted value: 78.4250

Factor importance:
  1. hidden_layers  (effect: 9.6, contribution: 36.8%)
  2. learning_rate  (effect: -8.5, contribution: 32.7%)
  3. dropout  (effect: -6.1, contribution: 23.3%)
  4. optimizer  (effect: 1.0, contribution: 3.8%)
  5. batch_size  (effect: -0.9, contribution: 3.4%)

=== Optimization: training_time ===
Direction: minimize

Best observed run: #4
  learning_rate = 0.001
  batch_size = 32
  dropout = 0.5
  hidden_layers = 6
  optimizer = sgd
  Value: 77.3

RSM Model (linear, R² = 0.73):
  Coefficients:
    intercept:  +105.7000
    learning_rate:  +10.4750
    batch_size:  -1.0500
    dropout:  +7.4750
    hidden_layers:  -9.4750
    optimizer:  -3.4500
  Predicted optimum:
    learning_rate = 0.1
    batch_size = 32
    dropout = 0.5
    hidden_layers = 2
    optimizer = adam
    Predicted value: 137.6250

Factor importance:
  1. learning_rate  (effect: 21.0, contribution: 32.8%)
  2. hidden_layers  (effect: -19.0, contribution: 29.7%)
  3. dropout  (effect: 15.0, contribution: 23.4%)
  4. optimizer  (effect: -6.9, contribution: 10.8%)
  5. batch_size  (effect: 2.1, contribution: 3.3%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 1.0000

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`accuracy`	1.5	1.0000	80.87 1.0000 80.87 %	↑
`training_time`	1.0	1.0000	73.29 1.0000 73.29 sec	↓

Recommended Settings

Factor	Value
`learning_rate`	0.09247
`batch_size`	34.55
`dropout`	0.4207
`hidden_layers`	5.91
`optimizer`	adam

Source: from RSM model prediction

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`training_time`	73.29	77.30	-4.01

Top 3 Runs by Desirability

Run	D	Factor Settings
#3	0.8029	learning_rate=0.1, batch_size=256, dropout=0.5, hidden_layers=6, optimizer=adam
#6	0.7830	learning_rate=0.001, batch_size=32, dropout=0.5, hidden_layers=6, optimizer=sgd

Model Quality

Response	R²	Type
`training_time`	0.9277	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 1.0000

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
accuracy                     1.5       1.0000       80.87 %   ↑
training_time                1.0       1.0000       73.29 sec   ↓

Recommended settings:
  learning_rate = 0.09247
  batch_size = 34.55
  dropout = 0.4207
  hidden_layers = 5.91
  optimizer = adam
  (from RSM model prediction)

Trade-off summary:
  accuracy: 80.87 (best observed: 78.10, sacrifice: -2.77)
  training_time: 73.29 (best observed: 77.30, sacrifice: -4.01)

Model quality:
  accuracy: R² = 0.9745 (linear)
  training_time: R² = 0.9277 (linear)

Top 3 observed runs by overall desirability:
  1. Run #4 (D=0.9545): learning_rate=0.001, batch_size=32, dropout=0.1, hidden_layers=6, optimizer=adam
  2. Run #3 (D=0.8029): learning_rate=0.1, batch_size=256, dropout=0.5, hidden_layers=6, optimizer=adam
  3. Run #6 (D=0.7830): learning_rate=0.001, batch_size=32, dropout=0.5, hidden_layers=6, optimizer=sgd