← All Use Cases
🌐
Box-Behnken

InfiniBand Network Tuning

Optimize InfiniBand HDR message rate and tail latency through MTU, queue depth, and RDMA settings.

Summary

This experiment investigates infiniband network tuning. Box-Behnken design to optimize InfiniBand HDR network throughput and tail latency.

The design varies 3 factors: mtu (bytes), ranging from 2048 to 4096, queue depth, ranging from 64 to 512, and rdma cm, ranging from 0 to 1. The goal is to optimize 2 responses: msg rate (Mmsg/s) (maximize) and p99 lat (us) (minimize). Fixed conditions held constant across all runs include ib speed = HDR, ports = 1.

A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For msg rate, the most influential factors were mtu (38.2%), queue depth (34.7%), rdma cm (27.1%). The best observed value was 161.8 (at mtu = 3072, queue depth = 288, rdma cm = 0.5).

For p99 lat, the most influential factors were rdma cm (34.8%), mtu (34.5%), queue depth (30.6%). The best observed value was 1.854 (at mtu = 4096, queue depth = 288, rdma cm = 0).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
mtu2048, 4096continuousbytes
queue_depth64, 512continuous
rdma_cm0, 1continuous

Fixed: ib_speed=HDR, ports=1

Responses

ResponseDirectionUnit
msg_rate↑ maximizeMmsg/s
p99_lat↓ minimizeus

Experimental Matrix

The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.

Runmtuqueue_depthrdma_cm
13072640
230722880.5
340962881
440962880
530722880.5
630722880.5
720482881
84096640.5
93072641
1040965120.5
1120482880
1230725121
132048640.5
1420485120.5
1530725120

How to Run

terminal
$ doe info --config use_cases/11_infiniband_network/config.json $ doe generate --config use_cases/11_infiniband_network/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/11_infiniband_network/config.json $ doe optimize --config use_cases/11_infiniband_network/config.json $ doe optimize --config use_cases/11_infiniband_network/config.json --multi # multi-objective $ doe report --config use_cases/11_infiniband_network/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: msg_rate

Pareto Chart

Pareto chart for msg_rate

Main Effects Plot

Main effects plot for msg_rate

Response: p99_lat

Pareto Chart

Pareto chart for p99_lat

Main Effects Plot

Main effects plot for p99_lat

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

  • Flat surface — these two factors have little effect on the response.
  • Tilted plane — strong linear effect; moving along one axis consistently changes the response.
  • Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
  • Saddle shape — significant interaction; the best setting of one factor depends on the other.
  • Red dots far from surface — poor model fit in that region; be cautious about predictions there.

msg_rate (Mmsg/s) — R² = 0.864, Adj R² = 0.618
The model fits well — the surface shape is reliable.
Curvature detected in mtu, rdma_cm — look for a peak or valley in the surface.
Strongest linear driver: rdma_cm (increases msg_rate).
Notable interaction: mtu × rdma_cm — the effect of one depends on the level of the other. Look for a twisted surface.

p99_lat (us) — R² = 0.622, Adj R² = -0.058
Moderate fit — surface shows general trends but some noise remains.
Strongest linear driver: queue_depth (increases p99_lat).

msg: rate mtu vs queue depth

RSM surface: msg — rate mtu vs queue depth

msg: rate mtu vs rdma cm

RSM surface: msg — rate mtu vs rdma cm

msg: rate queue depth vs rdma cm

RSM surface: msg — rate queue depth vs rdma cm

p99: lat mtu vs queue depth

RSM surface: p99 — lat mtu vs queue depth

p99: lat mtu vs rdma cm

RSM surface: p99 — lat mtu vs rdma cm

p99: lat queue depth vs rdma cm

RSM surface: p99 — lat queue depth vs rdma cm

Full Analysis Output

doe analyze
=== Main Effects: msg_rate === Factor Effect Std Error % Contribution -------------------------------------------------------------- mtu 26.0250 4.9434 45.6% rdma_cm 20.0250 4.9434 35.1% queue_depth 11.0250 4.9434 19.3% === Summary Statistics: msg_rate === mtu: Level N Mean Std Min Max ------------------------------------------------------------ 2048 4 119.3250 18.6784 98.6000 138.9000 3072 7 124.5286 17.9595 102.5000 146.2000 4096 4 145.3500 13.7551 128.3000 161.8000 queue_depth: Level N Mean Std Min Max ------------------------------------------------------------ 288 7 123.8000 18.3463 98.6000 147.4000 512 4 131.1250 19.5172 103.7000 146.2000 64 4 134.8250 23.2940 109.1000 161.8000 rdma_cm: Level N Mean Std Min Max ------------------------------------------------------------ 0 4 118.9250 21.7403 98.6000 145.1000 0.5 7 128.4143 20.5829 102.5000 161.8000 1 4 138.9500 11.0889 123.3000 147.4000 === Main Effects: p99_lat === Factor Effect Std Error % Contribution -------------------------------------------------------------- rdma_cm 0.3216 0.0503 57.7% queue_depth 0.1816 0.0503 32.6% mtu 0.0542 0.0503 9.7% === Summary Statistics: p99_lat === mtu: Level N Mean Std Min Max ------------------------------------------------------------ 2048 4 2.1778 0.3372 1.8540 2.6320 3072 7 2.1336 0.1425 1.9730 2.3140 4096 4 2.1877 0.1403 2.0240 2.3670 queue_depth: Level N Mean Std Min Max ------------------------------------------------------------ 288 7 2.2401 0.2271 1.9730 2.6320 512 4 2.1205 0.1382 2.0090 2.3080 64 4 2.0585 0.1548 1.8540 2.1790 rdma_cm: Level N Mean Std Min Max ------------------------------------------------------------ 0 4 2.1405 0.0841 2.0230 2.2160 0.5 7 2.0499 0.1503 1.8540 2.3140 1 4 2.3715 0.1906 2.1790 2.6320

Optimization Recommendations

doe optimize
=== Optimization: msg_rate === Direction: maximize Best observed run: #10 mtu = 4096 queue_depth = 512 rdma_cm = 0.5 Value: 161.8 RSM Model (linear, R² = 0.34): Coefficients: intercept: +128.6933 mtu: +13.6875 queue_depth: +5.3625 rdma_cm: +1.4250 Predicted optimum: mtu = 4096 queue_depth = 512 rdma_cm = 0.5 Predicted value: 147.7433 Factor importance: 1. mtu (effect: 27.4, contribution: 57.0%) 2. queue_depth (effect: 10.7, contribution: 22.3%) 3. rdma_cm (effect: 9.9, contribution: 20.7%) === Optimization: p99_lat === Direction: minimize Best observed run: #8 mtu = 2048 queue_depth = 288 rdma_cm = 1 Value: 1.854 RSM Model (linear, R² = 0.11): Coefficients: intercept: +2.1598 mtu: +0.0219 queue_depth: +0.0341 rdma_cm: -0.0760 Predicted optimum: mtu = 3072 queue_depth = 512 rdma_cm = 0 Predicted value: 2.2699 Factor importance: 1. rdma_cm (effect: 0.2, contribution: 36.2%) 2. mtu (effect: 0.1, contribution: 33.2%) 3. queue_depth (effect: 0.1, contribution: 30.6%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.7799

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
msg_rate 1.5
0.9545
161.80 0.9545 161.80 Mmsg/s
p99_lat 1.0
0.5760
2.18 0.5760 2.18 us

Recommended Settings

FactorValue
mtu3072 bytes
queue_depth288
rdma_cm0.5

Source: from observed run #10

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
p99_lat2.181.85+0.32

Top 3 Runs by Desirability

RunDFactor Settings
#40.7311mtu=2048, queue_depth=288, rdma_cm=1
#30.7200mtu=3072, queue_depth=512, rdma_cm=1

Model Quality

ResponseType
p99_lat0.8835quadratic

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.7799 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- msg_rate 1.5 0.9545 161.80 Mmsg/s ↑ p99_lat 1.0 0.5760 2.18 us ↓ Recommended settings: mtu = 3072 bytes queue_depth = 288 rdma_cm = 0.5 (from observed run #10) Trade-off summary: msg_rate: 161.80 (best observed: 161.80, sacrifice: +0.00) p99_lat: 2.18 (best observed: 1.85, sacrifice: +0.32) Model quality: msg_rate: R² = 0.3266 (linear) p99_lat: R² = 0.8835 (quadratic) Top 3 observed runs by overall desirability: 1. Run #10 (D=0.7799): mtu=3072, queue_depth=288, rdma_cm=0.5 2. Run #4 (D=0.7311): mtu=2048, queue_depth=288, rdma_cm=1 3. Run #3 (D=0.7200): mtu=3072, queue_depth=512, rdma_cm=1
← NUMA Memory Placement Job Scheduler Packing →