MPI Collective Tuning — DOE Use Case

Summary

This experiment investigates mpi collective tuning. Plackett-Burman screening design to optimize MPI collective communication performance across 6 factors.

The design varies 6 factors: msg size (bytes), ranging from 4096 to 1048576, algorithm, ranging from ring to recursive_doubling, ppn, ranging from 16 to 64, eager limit (bytes), ranging from 4096 to 262144, binding, ranging from core to socket, and coll tuning, ranging from on to off. The goal is to optimize 2 responses: allreduce latency (us) (minimize) and bandwidth (GB/s) (maximize). Fixed conditions held constant across all runs include nodes = 32, mpi impl = openmpi.

A Plackett-Burman screening design was used to efficiently test 6 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.

Key Findings

For allreduce latency, the most influential factors were eager limit (24.8%), coll tuning (23.8%), binding (17.8%). The best observed value was 34.2 (at msg size = 4096, algorithm = ring, ppn = 64).

For bandwidth, the most influential factors were algorithm (25.2%), eager limit (24.2%), msg size (19.4%). The best observed value was 30.14 (at msg size = 1048576, algorithm = ring, ppn = 16).

Recommended Next Steps

Follow up with a response surface design (CCD or Box-Behnken) on the top 3–4 factors to model curvature and find the true optimum.
Consider whether any fixed factors should be varied in a future study.
The screening results can guide factor reduction — drop factors contributing less than 5% and re-run with a smaller, more focused design.

Experimental Setup

Factors

Factor	Levels	Type	Unit
`msg_size`	4096, 1048576	continuous	bytes
`algorithm`	ring, recursive_doubling	categorical
`ppn`	16, 64	continuous
`eager_limit`	4096, 262144	continuous	bytes
`binding`	core, socket	categorical
`coll_tuning`	on, off	categorical

Fixed: nodes=32, mpi_impl=openmpi

Responses

Response	Direction	Unit
`allreduce_latency`	↓ minimize	us
`bandwidth`	↑ maximize	GB/s

Experimental Matrix

The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.

Run	Block	`msg_size`	`algorithm`	`ppn`	`eager_limit`	`binding`	`coll_tuning`
1	1	1048576	recursive_doubling	64	4096	core	on
2	1	4096	ring	64	262144	core	on
3	1	4096	recursive_doubling	16	262144	core	off
4	1	1048576	recursive_doubling	64	262144	socket	off
5	1	4096	recursive_doubling	16	4096	socket	on
6	1	1048576	ring	16	262144	socket	on
7	1	4096	ring	64	4096	socket	off
8	1	1048576	ring	16	4096	core	off
9	2	1048576	recursive_doubling	64	4096	core	on
10	2	1048576	recursive_doubling	64	262144	socket	off
11	2	4096	recursive_doubling	16	4096	socket	on
12	2	4096	ring	64	4096	socket	off
13	2	4096	ring	64	262144	core	on
14	2	4096	recursive_doubling	16	262144	core	off
15	2	1048576	ring	16	262144	socket	on
16	2	1048576	ring	16	4096	core	off

How to Run

terminal
$ doe info --config use_cases/07_mpi_collective_tuning/config.json
$ doe generate --config use_cases/07_mpi_collective_tuning/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/07_mpi_collective_tuning/config.json
$ doe optimize --config use_cases/07_mpi_collective_tuning/config.json
$ doe optimize --config use_cases/07_mpi_collective_tuning/config.json --multi  # multi-objective
$ doe report --config use_cases/07_mpi_collective_tuning/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: allreduce_latency

Pareto Chart

Main Effects Plot

Response: bandwidth

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

Flat surface — these two factors have little effect on the response.
Tilted plane — strong linear effect; moving along one axis consistently changes the response.
Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
Saddle shape — significant interaction; the best setting of one factor depends on the other.
Red dots far from surface — poor model fit in that region; be cautious about predictions there.

allreduce_latency (us) — R² = 0.753, Adj R² = 0.753
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in msg_size, algorithm — look for a peak or valley in the surface.
Strongest linear driver: algorithm (decreases allreduce_latency).
Notable interaction: msg_size × coll_tuning — the effect of one depends on the level of the other. Look for a twisted surface.

bandwidth (GB/s) — R² = 0.299, Adj R² = 0.299
Weak fit — interpret the surface shape with caution.
Curvature detected in msg_size, algorithm — look for a peak or valley in the surface.
Strongest linear driver: algorithm (decreases bandwidth).
Notable interaction: msg_size × coll_tuning — the effect of one depends on the level of the other. Look for a twisted surface.

allreduce: latency msg size vs eager limit

allreduce: latency msg size vs ppn

allreduce: latency ppn vs eager limit

bandwidth: msg size vs eager limit

bandwidth: msg size vs ppn

bandwidth: ppn vs eager limit

Full Analysis Output

doe analyze
=== Main Effects: allreduce_latency ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
eager_limit            -23.7200       5.1709            37.2%
binding                 15.3625       5.1709            24.1%
msg_size               -13.1775       5.1709            20.7%
ppn                     -5.9300       5.1709             9.3%
coll_tuning              3.9775       5.1709             6.2%
algorithm               -1.6375       5.1709             2.6%

=== Interaction Effects: allreduce_latency ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
msg_size             binding                  -23.7200            17.8%
algorithm            coll_tuning               23.7200            17.8%
msg_size             eager_limit               15.3625            11.5%
ppn                  coll_tuning              -15.3625            11.5%
algorithm            ppn                      -13.1775             9.9%
eager_limit          binding                  -13.1775             9.9%
msg_size             algorithm                 -5.9300             4.4%
binding              coll_tuning                5.9300             4.4%
algorithm            eager_limit               -3.9775             3.0%
ppn                  binding                   -3.9775             3.0%
msg_size             coll_tuning               -2.0050             1.5%
algorithm            binding                    2.0050             1.5%
ppn                  eager_limit                2.0050             1.5%
msg_size             ppn                       -1.6375             1.2%
eager_limit          coll_tuning                1.6375             1.2%

=== Summary Statistics: allreduce_latency ===

msg_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1048576             8    66.1363    23.5214    34.2000    96.6600
  4096                8    52.9588    16.2556    35.5400    84.3300

algorithm:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  recursive_doubling     8    60.3663    21.3574    35.5400    91.5800
  ring                8    58.7287    21.4256    34.2000    96.6600

ppn:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  16                  8    62.5125    24.6412    34.2000    96.6600
  64                  8    56.5825    17.0131    39.3600    91.5800

eager_limit:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  262144              8    71.4075    21.6409    35.5400    96.6600
  4096                8    47.6875    11.2646    34.2000    67.4600

binding:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  core                8    51.8663    16.2748    34.2000    84.3300
  socket              8    67.2288    22.7375    39.6400    96.6600

coll_tuning:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    57.5588    22.0870    34.2000    91.5800
  on                  8    61.5362    20.4902    39.3600    96.6600

=== Main Effects: bandwidth ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
eager_limit             -4.2275       1.7031            31.1%
ppn                     -3.5800       1.7031            26.4%
msg_size                -2.9300       1.7031            21.6%
coll_tuning             -1.0375       1.7031             7.6%
binding                  0.9575       1.7031             7.1%
algorithm                0.8400       1.7031             6.2%

=== Interaction Effects: bandwidth ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
msg_size             coll_tuning                5.5025            12.6%
algorithm            binding                   -5.5025            12.6%
ppn                  eager_limit               -5.5025            12.6%
msg_size             binding                   -4.2275             9.7%
algorithm            coll_tuning                4.2275             9.7%
msg_size             algorithm                 -3.5800             8.2%
binding              coll_tuning                3.5800             8.2%
algorithm            ppn                       -2.9300             6.7%
eager_limit          binding                   -2.9300             6.7%
algorithm            eager_limit                1.0375             2.4%
ppn                  binding                    1.0375             2.4%
msg_size             eager_limit                0.9575             2.2%
ppn                  coll_tuning               -0.9575             2.2%
msg_size             ppn                        0.8400             1.9%
eager_limit          coll_tuning               -0.8400             1.9%

=== Summary Statistics: bandwidth ===

msg_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1048576             8    19.9925     6.5195     9.5400    30.1400
  4096                8    17.0625     7.2134     9.2100    29.9100

algorithm:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  recursive_doubling     8    18.1075     5.6671     9.5400    24.5600
  ring                8    18.9475     8.1807     9.2100    30.1400

ppn:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  16                  8    20.3175     5.5034    13.5000    30.1400
  64                  8    16.7375     7.8634     9.2100    29.9100

eager_limit:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  262144              8    20.6412     6.1679     9.2100    29.9100
  4096                8    16.4138     7.1546     9.5400    30.1400

binding:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  core                8    18.0488     8.1889     9.2100    30.1400
  socket              8    19.0062     5.6445    11.2800    24.5600

coll_tuning:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    19.0463     6.5192    11.2800    30.1400
  on                  8    18.0087     7.5052     9.2100    29.9100

Optimization Recommendations

doe optimize
=== Optimization: allreduce_latency ===
Direction: minimize

Best observed run: #2
  msg_size = 1048576
  algorithm = recursive_doubling
  ppn = 64
  eager_limit = 4096
  binding = core
  coll_tuning = on
  Value: 34.2

RSM Model (linear, R² = 0.27):
  Coefficients:
    intercept:  +59.5475
    msg_size:  -2.3600
    algorithm:  -2.4125
    ppn:  +4.3000
    eager_limit:  -4.6425
    binding:  +6.7700
    coll_tuning:  +3.4600
  Predicted optimum:
    msg_size = 4096
    algorithm = recursive_doubling
    ppn = 16
    eager_limit = 4096
    binding = socket
    coll_tuning = on
    Predicted value: 74.8925

Factor importance:
  1. binding  (effect: 13.5, contribution: 28.3%)
  2. eager_limit  (effect: 9.3, contribution: 19.4%)
  3. ppn  (effect: 8.6, contribution: 18.0%)
  4. coll_tuning  (effect: 6.9, contribution: 14.4%)
  5. algorithm  (effect: -4.8, contribution: 10.1%)
  6. msg_size  (effect: 4.7, contribution: 9.9%)

=== Optimization: bandwidth ===
Direction: maximize

Best observed run: #15
  msg_size = 1048576
  algorithm = ring
  ppn = 16
  eager_limit = 262144
  binding = socket
  coll_tuning = on
  Value: 30.14

RSM Model (linear, R² = 0.21):
  Coefficients:
    intercept:  +18.5275
    msg_size:  +0.1463
    algorithm:  +0.9300
    ppn:  -1.3213
    eager_limit:  -1.9263
    binding:  +0.9800
    coll_tuning:  +1.3487
  Predicted optimum:
    msg_size = 4096
    algorithm = recursive_doubling
    ppn = 16
    eager_limit = 4096
    binding = socket
    coll_tuning = on
    Predicted value: 23.0275

Factor importance:
  1. eager_limit  (effect: 3.9, contribution: 29.0%)
  2. coll_tuning  (effect: 2.7, contribution: 20.3%)
  3. ppn  (effect: -2.6, contribution: 19.9%)
  4. binding  (effect: 2.0, contribution: 14.7%)
  5. algorithm  (effect: 1.9, contribution: 14.0%)
  6. msg_size  (effect: -0.3, contribution: 2.2%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.8027

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`allreduce_latency`	1.0	0.6288	56.58 0.6288 56.58 us	↓
`bandwidth`	1.5	0.9446	29.91 0.9446 29.91 GB/s	↑

Recommended Settings

Factor	Value
`msg_size`	1048576 bytes
`algorithm`	recursive_doubling
`ppn`	64
`eager_limit`	4096 bytes
`binding`	core
`coll_tuning`	on

Source: from observed run #6

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`bandwidth`	29.91	30.14	+0.23

Top 3 Runs by Desirability

Run	D	Factor Settings
#15	0.8024	msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=4096, binding=socket, coll_tuning=on
#16	0.6033	msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=262144, binding=core, coll_tuning=off

Model Quality

Response	R²	Type
`bandwidth`	0.3071	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.8027

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
allreduce_latency            1.0       0.6288       56.58 us   ↓
bandwidth                    1.5       0.9446       29.91 GB/s   ↑

Recommended settings:
  msg_size = 1048576 bytes
  algorithm = recursive_doubling
  ppn = 64
  eager_limit = 4096 bytes
  binding = core
  coll_tuning = on
  (from observed run #6)

Trade-off summary:
  allreduce_latency: 56.58 (best observed: 34.20, sacrifice: +22.38)
  bandwidth: 29.91 (best observed: 30.14, sacrifice: +0.23)

Model quality:
  allreduce_latency: R² = 0.2755 (linear)
  bandwidth: R² = 0.3071 (linear)

Top 3 observed runs by overall desirability:
  1. Run #6 (D=0.8027): msg_size=1048576, algorithm=recursive_doubling, ppn=64, eager_limit=4096, binding=core, coll_tuning=on
  2. Run #15 (D=0.8024): msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=4096, binding=socket, coll_tuning=on
  3. Run #16 (D=0.6033): msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=262144, binding=core, coll_tuning=off