InfiniBand Network Tuning

Summary

This experiment investigates infiniband network tuning. Box-Behnken design to optimize InfiniBand HDR network throughput and tail latency.

The design varies 3 factors: mtu (bytes), ranging from 2048 to 4096, queue depth, ranging from 64 to 512, and rdma cm, ranging from 0 to 1. The goal is to optimize 2 responses: msg rate (Mmsg/s) (maximize) and p99 lat (us) (minimize). Fixed conditions held constant across all runs include ib speed = HDR, ports = 1.

A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For msg rate, the most influential factors were mtu (38.2%), queue depth (34.7%), rdma cm (27.1%). The best observed value was 161.8 (at mtu = 3072, queue depth = 288, rdma cm = 0.5).

For p99 lat, the most influential factors were rdma cm (34.8%), mtu (34.5%), queue depth (30.6%). The best observed value was 1.854 (at mtu = 4096, queue depth = 288, rdma cm = 0).

Recommended Next Steps

Run confirmation experiments at the predicted optimal settings to validate the model.
Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Levels	Type	Unit
`mtu`	2048, 4096	continuous	bytes
`queue_depth`	64, 512	continuous
`rdma_cm`	0, 1	continuous

Fixed: ib_speed=HDR, ports=1

Responses

Response	Direction	Unit
`msg_rate`	↑ maximize	Mmsg/s
`p99_lat`	↓ minimize	us

Experimental Matrix

The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.

Run	`mtu`	`queue_depth`	`rdma_cm`
1	3072	64	0
2	3072	288	0.5
3	4096	288	1
4	4096	288	0
5	3072	288	0.5
6	3072	288	0.5
7	2048	288	1
8	4096	64	0.5
9	3072	64	1
10	4096	512	0.5
11	2048	288	0
12	3072	512	1
13	2048	64	0.5
14	2048	512	0.5
15	3072	512	0

How to Run

terminal
$ doe info --config use_cases/11_infiniband_network/config.json
$ doe generate --config use_cases/11_infiniband_network/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/11_infiniband_network/config.json
$ doe optimize --config use_cases/11_infiniband_network/config.json
$ doe optimize --config use_cases/11_infiniband_network/config.json --multi  # multi-objective
$ doe report --config use_cases/11_infiniband_network/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: msg_rate

Pareto Chart

Main Effects Plot

Response: p99_lat

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

Flat surface — these two factors have little effect on the response.
Tilted plane — strong linear effect; moving along one axis consistently changes the response.
Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
Saddle shape — significant interaction; the best setting of one factor depends on the other.
Red dots far from surface — poor model fit in that region; be cautious about predictions there.

msg_rate (Mmsg/s) — R² = 0.864, Adj R² = 0.618
The model fits well — the surface shape is reliable.
Curvature detected in mtu, rdma_cm — look for a peak or valley in the surface.
Strongest linear driver: rdma_cm (increases msg_rate).
Notable interaction: mtu × rdma_cm — the effect of one depends on the level of the other. Look for a twisted surface.

p99_lat (us) — R² = 0.622, Adj R² = -0.058
Moderate fit — surface shows general trends but some noise remains.
Strongest linear driver: queue_depth (increases p99_lat).

msg: rate mtu vs queue depth

msg: rate mtu vs rdma cm

msg: rate queue depth vs rdma cm

p99: lat mtu vs queue depth

p99: lat mtu vs rdma cm

p99: lat queue depth vs rdma cm

Full Analysis Output

doe analyze
=== Main Effects: msg_rate ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
mtu                     26.0250       4.9434            45.6%
rdma_cm                 20.0250       4.9434            35.1%
queue_depth             11.0250       4.9434            19.3%

=== Summary Statistics: msg_rate ===

mtu:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2048                4   119.3250    18.6784    98.6000   138.9000
  3072                7   124.5286    17.9595   102.5000   146.2000
  4096                4   145.3500    13.7551   128.3000   161.8000

queue_depth:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  288                 7   123.8000    18.3463    98.6000   147.4000
  512                 4   131.1250    19.5172   103.7000   146.2000
  64                  4   134.8250    23.2940   109.1000   161.8000

rdma_cm:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0                   4   118.9250    21.7403    98.6000   145.1000
  0.5                 7   128.4143    20.5829   102.5000   161.8000
  1                   4   138.9500    11.0889   123.3000   147.4000

=== Main Effects: p99_lat ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
rdma_cm                  0.3216       0.0503            57.7%
queue_depth              0.1816       0.0503            32.6%
mtu                      0.0542       0.0503             9.7%

=== Summary Statistics: p99_lat ===

mtu:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2048                4     2.1778     0.3372     1.8540     2.6320
  3072                7     2.1336     0.1425     1.9730     2.3140
  4096                4     2.1877     0.1403     2.0240     2.3670

queue_depth:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  288                 7     2.2401     0.2271     1.9730     2.6320
  512                 4     2.1205     0.1382     2.0090     2.3080
  64                  4     2.0585     0.1548     1.8540     2.1790

rdma_cm:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0                   4     2.1405     0.0841     2.0230     2.2160
  0.5                 7     2.0499     0.1503     1.8540     2.3140
  1                   4     2.3715     0.1906     2.1790     2.6320

Optimization Recommendations

doe optimize
=== Optimization: msg_rate ===
Direction: maximize

Best observed run: #10
  mtu = 4096
  queue_depth = 512
  rdma_cm = 0.5
  Value: 161.8

RSM Model (linear, R² = 0.34):
  Coefficients:
    intercept:  +128.6933
    mtu:  +13.6875
    queue_depth:  +5.3625
    rdma_cm:  +1.4250
  Predicted optimum:
    mtu = 4096
    queue_depth = 512
    rdma_cm = 0.5
    Predicted value: 147.7433

Factor importance:
  1. mtu  (effect: 27.4, contribution: 57.0%)
  2. queue_depth  (effect: 10.7, contribution: 22.3%)
  3. rdma_cm  (effect: 9.9, contribution: 20.7%)

=== Optimization: p99_lat ===
Direction: minimize

Best observed run: #8
  mtu = 2048
  queue_depth = 288
  rdma_cm = 1
  Value: 1.854

RSM Model (linear, R² = 0.11):
  Coefficients:
    intercept:  +2.1598
    mtu:  +0.0219
    queue_depth:  +0.0341
    rdma_cm:  -0.0760
  Predicted optimum:
    mtu = 3072
    queue_depth = 512
    rdma_cm = 0
    Predicted value: 2.2699

Factor importance:
  1. rdma_cm  (effect: 0.2, contribution: 36.2%)
  2. mtu  (effect: 0.1, contribution: 33.2%)
  3. queue_depth  (effect: 0.1, contribution: 30.6%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.7799

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`msg_rate`	1.5	0.9545	161.80 0.9545 161.80 Mmsg/s	↑
`p99_lat`	1.0	0.5760	2.18 0.5760 2.18 us	↓

Recommended Settings

Factor	Value
`mtu`	3072 bytes
`queue_depth`	288
`rdma_cm`	0.5

Source: from observed run #10

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`p99_lat`	2.18	1.85	+0.32

Top 3 Runs by Desirability

Run	D	Factor Settings
#4	0.7311	mtu=2048, queue_depth=288, rdma_cm=1
#3	0.7200	mtu=3072, queue_depth=512, rdma_cm=1

Model Quality

Response	R²	Type
`p99_lat`	0.8835	quadratic

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.7799

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
msg_rate                     1.5       0.9545      161.80 Mmsg/s   ↑
p99_lat                      1.0       0.5760        2.18 us   ↓

Recommended settings:
  mtu = 3072 bytes
  queue_depth = 288
  rdma_cm = 0.5
  (from observed run #10)

Trade-off summary:
  msg_rate: 161.80 (best observed: 161.80, sacrifice: +0.00)
  p99_lat: 2.18 (best observed: 1.85, sacrifice: +0.32)

Model quality:
  msg_rate: R² = 0.3266 (linear)
  p99_lat: R² = 0.8835 (quadratic)

Top 3 observed runs by overall desirability:
  1. Run #10 (D=0.7799): mtu=3072, queue_depth=288, rdma_cm=0.5
  2. Run #4 (D=0.7311): mtu=2048, queue_depth=288, rdma_cm=1
  3. Run #3 (D=0.7200): mtu=3072, queue_depth=512, rdma_cm=1