← All Use Cases
🤖
Box-Behnken

Distributed Deep Learning

Scale ResNet-50 training across GPUs while maintaining efficiency.

Summary

This experiment investigates distributed deep learning scaling. Optimize multi-GPU distributed training configuration for ResNet-50 on ImageNet.

The design varies 3 factors: gpu count (GPUs), ranging from 8 to 64, batch per gpu (images), ranging from 32 to 256, and gradient compression (%), ranging from 0 to 90. The goal is to optimize 2 responses: images per sec (img/s) (maximize) and scaling efficiency (%) (maximize).

A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For images per sec, the most influential factors were batch per gpu (42.1%), gradient compression (39.2%), gpu count (18.7%). The best observed value was 27276.1 (at gpu count = 64, batch per gpu = 256, gradient compression = 45).

For scaling efficiency, the most influential factors were gradient compression (45.2%), batch per gpu (36.3%), gpu count (18.5%). The best observed value was 99.0 (at gpu count = 8, batch per gpu = 256, gradient compression = 45).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
gpu_count8, 64continuousGPUs
batch_per_gpu32, 256continuousimages
gradient_compression0, 90continuous%

Fixed: none

Responses

ResponseDirectionUnit
images_per_sec↑ maximizeimg/s
scaling_efficiency↑ maximize%

Experimental Matrix

The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.

Rungpu_countbatch_per_gpugradient_compression
136320
23614445
36414490
4641440
53614445
63614445
7814490
8643245
9363290
106425645
1181440
123625690
1383245
14825645
15362560

How to Run

terminal
$ doe info --config use_cases/15_distributed_training/config.json $ doe generate --config use_cases/15_distributed_training/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/15_distributed_training/config.json $ doe optimize --config use_cases/15_distributed_training/config.json $ doe report --config use_cases/15_distributed_training/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: images_per_sec

Pareto Chart

Pareto chart for images_per_sec

Main Effects Plot

Main effects plot for images_per_sec

Response: scaling_efficiency

Pareto Chart

Pareto chart for scaling_efficiency

Main Effects Plot

Main effects plot for scaling_efficiency

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

  • Flat surface — these two factors have little effect on the response.
  • Tilted plane — strong linear effect; moving along one axis consistently changes the response.
  • Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
  • Saddle shape — significant interaction; the best setting of one factor depends on the other.
  • Red dots far from surface — poor model fit in that region; be cautious about predictions there.

images_per_sec (img/s) — R² = 0.831, Adj R² = 0.526
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in batch_per_gpu, gradient_compression — look for a peak or valley in the surface.
Strongest linear driver: batch_per_gpu (increases images_per_sec).
Notable interaction: gpu_count × batch_per_gpu — the effect of one depends on the level of the other. Look for a twisted surface.

scaling_efficiency (%) — R² = 0.627, Adj R² = -0.045
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in gradient_compression, gpu_count — look for a peak or valley in the surface.
Strongest linear driver: batch_per_gpu (decreases scaling_efficiency).
Notable interaction: gpu_count × batch_per_gpu — the effect of one depends on the level of the other. Look for a twisted surface.

images: per sec batch per gpu vs gradient compression

RSM surface: images — per sec batch per gpu vs gradient compression

images: per sec gpu count vs batch per gpu

RSM surface: images — per sec gpu count vs batch per gpu

images: per sec gpu count vs gradient compression

RSM surface: images — per sec gpu count vs gradient compression

scaling: efficiency batch per gpu vs gradient compression

RSM surface: scaling — efficiency batch per gpu vs gradient compression

scaling: efficiency gpu count vs batch per gpu

RSM surface: scaling — efficiency gpu count vs batch per gpu

scaling: efficiency gpu count vs gradient compression

RSM surface: scaling — efficiency gpu count vs gradient compression

Full Analysis Output

doe analyze
=== Main Effects: images_per_sec === Factor Effect Std Error % Contribution -------------------------------------------------------------- gpu_count 9143.9500 1997.6298 42.1% gradient_compression 6382.5000 1997.6298 29.4% batch_per_gpu 6215.2000 1997.6298 28.6% === Summary Statistics: images_per_sec === gpu_count: Level N Mean Std Min Max ------------------------------------------------------------ 36 7 9785.4286 7214.6845 1100.5000 20235.7000 64 4 6920.5750 3516.3765 3806.9000 11499.3000 8 4 16064.5250 10173.5266 2693.1000 27276.1000 batch_per_gpu: Level N Mean Std Min Max ------------------------------------------------------------ 144 7 10633.8571 9817.8742 1100.5000 27276.1000 256 4 7642.5750 7297.1113 2723.4000 18500.0000 32 4 13857.7750 2745.5949 11499.3000 16643.6000 gradient_compression: Level N Mean Std Min Max ------------------------------------------------------------ 0 4 6466.7500 6804.3168 2693.1000 16643.6000 45 7 11882.0571 7060.7804 1100.5000 20235.7000 90 4 12849.2500 10000.7841 4796.2000 27276.1000 === Main Effects: scaling_efficiency === Factor Effect Std Error % Contribution -------------------------------------------------------------- gradient_compression 6.7750 1.1818 48.7% gpu_count 5.3250 1.1818 38.3% batch_per_gpu 1.8143 1.1818 13.0% === Summary Statistics: scaling_efficiency === gpu_count: Level N Mean Std Min Max ------------------------------------------------------------ 36 7 94.8000 2.7080 91.3000 98.5000 64 4 92.4250 4.8624 87.9000 99.0000 8 4 89.4750 5.9337 83.4000 97.4000 batch_per_gpu: Level N Mean Std Min Max ------------------------------------------------------------ 144 7 93.4143 4.9144 87.1000 99.0000 256 4 91.6000 6.5243 83.4000 98.5000 32 4 92.7250 2.0255 90.0000 94.9000 gradient_compression: Level N Mean Std Min Max ------------------------------------------------------------ 0 4 97.4500 1.8267 94.9000 99.0000 45 7 91.2429 4.4669 83.4000 98.2000 90 4 90.6750 3.7456 87.1000 94.7000

Optimization Recommendations

doe optimize
=== Optimization: images_per_sec === Direction: maximize Best observed run: #10 gpu_count = 64 batch_per_gpu = 256 gradient_compression = 45 Value: 27276.1 RSM Model (linear, R² = 0.05): Coefficients: intercept: +10695.8933 gpu_count: +1634.1250 batch_per_gpu: +1714.1500 gradient_compression: -369.2250 Predicted optimum: gpu_count = 64 batch_per_gpu = 256 gradient_compression = 45 Predicted value: 14044.1683 Factor importance: 1. batch_per_gpu (effect: 6246.3, contribution: 44.0%) 2. gradient_compression (effect: 4686.9, contribution: 33.0%) 3. gpu_count (effect: 3268.3, contribution: 23.0%) === Optimization: scaling_efficiency === Direction: maximize Best observed run: #14 gpu_count = 36 batch_per_gpu = 144 gradient_compression = 45 Value: 99.0 RSM Model (linear, R² = 0.02): Coefficients: intercept: +92.7467 gpu_count: +0.8125 batch_per_gpu: -0.2125 gradient_compression: +0.1000 Predicted optimum: gpu_count = 64 batch_per_gpu = 32 gradient_compression = 45 Predicted value: 93.7717 Factor importance: 1. batch_per_gpu (effect: 5.1, contribution: 66.5%) 2. gpu_count (effect: 1.6, contribution: 21.2%) 3. gradient_compression (effect: 1.0, contribution: 12.4%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.6343

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
images_per_sec 1.5
0.5853
16643.60 0.5853 16643.60 img/s
scaling_efficiency 1.0
0.7156
94.90 0.7156 94.90 %

Recommended Settings

FactorValue
gpu_count8 GPUs
batch_per_gpu144 images
gradient_compression90 %

Source: from observed run #12

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
scaling_efficiency94.9099.00+4.10

Top 3 Runs by Desirability

RunDFactor Settings
#30.6200gpu_count=36, batch_per_gpu=144, gradient_compression=45
#100.5683gpu_count=64, batch_per_gpu=144, gradient_compression=90

Model Quality

ResponseType
scaling_efficiency0.1230linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.6343 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- images_per_sec 1.5 0.5853 16643.60 img/s ↑ scaling_efficiency 1.0 0.7156 94.90 % ↑ Recommended settings: gpu_count = 8 GPUs batch_per_gpu = 144 images gradient_compression = 90 % (from observed run #12) Trade-off summary: images_per_sec: 16643.60 (best observed: 27276.10, sacrifice: +10632.50) scaling_efficiency: 94.90 (best observed: 99.00, sacrifice: +4.10) Model quality: images_per_sec: R² = 0.1313 (linear) scaling_efficiency: R² = 0.1230 (linear) Top 3 observed runs by overall desirability: 1. Run #12 (D=0.6343): gpu_count=8, batch_per_gpu=144, gradient_compression=90 2. Run #3 (D=0.6200): gpu_count=36, batch_per_gpu=144, gradient_compression=45 3. Run #10 (D=0.5683): gpu_count=64, batch_per_gpu=144, gradient_compression=90
← Cache Blocking Strategy Next: CPU Cross-NUMA Bandwidth →