← All Use Cases
📡
Plackett-Burman

MPI Collective Tuning

Tune MPI collective communication parameters for large-scale HPC clusters.

Summary

This experiment investigates mpi collective tuning. Plackett-Burman screening design to optimize MPI collective communication performance across 6 factors.

The design varies 6 factors: msg size (bytes), ranging from 4096 to 1048576, algorithm, ranging from ring to recursive_doubling, ppn, ranging from 16 to 64, eager limit (bytes), ranging from 4096 to 262144, binding, ranging from core to socket, and coll tuning, ranging from on to off. The goal is to optimize 2 responses: allreduce latency (us) (minimize) and bandwidth (GB/s) (maximize). Fixed conditions held constant across all runs include nodes = 32, mpi impl = openmpi.

A Plackett-Burman screening design was used to efficiently test 6 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.

Key Findings

For allreduce latency, the most influential factors were eager limit (24.8%), coll tuning (23.8%), binding (17.8%). The best observed value was 34.2 (at msg size = 4096, algorithm = ring, ppn = 64).

For bandwidth, the most influential factors were algorithm (25.2%), eager limit (24.2%), msg size (19.4%). The best observed value was 30.14 (at msg size = 1048576, algorithm = ring, ppn = 16).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
msg_size4096, 1048576continuousbytes
algorithmring, recursive_doublingcategorical
ppn16, 64continuous
eager_limit4096, 262144continuousbytes
bindingcore, socketcategorical
coll_tuningon, offcategorical

Fixed: nodes=32, mpi_impl=openmpi

Responses

ResponseDirectionUnit
allreduce_latency↓ minimizeus
bandwidth↑ maximizeGB/s

Experimental Matrix

The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.

RunBlockmsg_sizealgorithmppneager_limitbindingcoll_tuning
111048576recursive_doubling644096coreon
214096ring64262144coreon
314096recursive_doubling16262144coreoff
411048576recursive_doubling64262144socketoff
514096recursive_doubling164096socketon
611048576ring16262144socketon
714096ring644096socketoff
811048576ring164096coreoff
921048576recursive_doubling644096coreon
1021048576recursive_doubling64262144socketoff
1124096recursive_doubling164096socketon
1224096ring644096socketoff
1324096ring64262144coreon
1424096recursive_doubling16262144coreoff
1521048576ring16262144socketon
1621048576ring164096coreoff

How to Run

terminal
$ doe info --config use_cases/07_mpi_collective_tuning/config.json $ doe generate --config use_cases/07_mpi_collective_tuning/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/07_mpi_collective_tuning/config.json $ doe optimize --config use_cases/07_mpi_collective_tuning/config.json $ doe optimize --config use_cases/07_mpi_collective_tuning/config.json --multi # multi-objective $ doe report --config use_cases/07_mpi_collective_tuning/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: allreduce_latency

Pareto Chart

Pareto chart for allreduce_latency

Main Effects Plot

Main effects plot for allreduce_latency

Response: bandwidth

Pareto Chart

Pareto chart for bandwidth

Main Effects Plot

Main effects plot for bandwidth

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

  • Flat surface — these two factors have little effect on the response.
  • Tilted plane — strong linear effect; moving along one axis consistently changes the response.
  • Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
  • Saddle shape — significant interaction; the best setting of one factor depends on the other.
  • Red dots far from surface — poor model fit in that region; be cautious about predictions there.

allreduce_latency (us) — R² = 0.753, Adj R² = 0.753
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in msg_size, algorithm — look for a peak or valley in the surface.
Strongest linear driver: algorithm (decreases allreduce_latency).
Notable interaction: msg_size × coll_tuning — the effect of one depends on the level of the other. Look for a twisted surface.

bandwidth (GB/s) — R² = 0.299, Adj R² = 0.299
Weak fit — interpret the surface shape with caution.
Curvature detected in msg_size, algorithm — look for a peak or valley in the surface.
Strongest linear driver: algorithm (decreases bandwidth).
Notable interaction: msg_size × coll_tuning — the effect of one depends on the level of the other. Look for a twisted surface.

allreduce: latency msg size vs eager limit

RSM surface: allreduce — latency msg size vs eager limit

allreduce: latency msg size vs ppn

RSM surface: allreduce — latency msg size vs ppn

allreduce: latency ppn vs eager limit

RSM surface: allreduce — latency ppn vs eager limit

bandwidth: msg size vs eager limit

RSM surface: bandwidth — msg size vs eager limit

bandwidth: msg size vs ppn

RSM surface: bandwidth — msg size vs ppn

bandwidth: ppn vs eager limit

RSM surface: bandwidth — ppn vs eager limit

Full Analysis Output

doe analyze
=== Main Effects: allreduce_latency === Factor Effect Std Error % Contribution -------------------------------------------------------------- eager_limit -23.7200 5.1709 37.2% binding 15.3625 5.1709 24.1% msg_size -13.1775 5.1709 20.7% ppn -5.9300 5.1709 9.3% coll_tuning 3.9775 5.1709 6.2% algorithm -1.6375 5.1709 2.6% === Interaction Effects: allreduce_latency === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ msg_size binding -23.7200 17.8% algorithm coll_tuning 23.7200 17.8% msg_size eager_limit 15.3625 11.5% ppn coll_tuning -15.3625 11.5% algorithm ppn -13.1775 9.9% eager_limit binding -13.1775 9.9% msg_size algorithm -5.9300 4.4% binding coll_tuning 5.9300 4.4% algorithm eager_limit -3.9775 3.0% ppn binding -3.9775 3.0% msg_size coll_tuning -2.0050 1.5% algorithm binding 2.0050 1.5% ppn eager_limit 2.0050 1.5% msg_size ppn -1.6375 1.2% eager_limit coll_tuning 1.6375 1.2% === Summary Statistics: allreduce_latency === msg_size: Level N Mean Std Min Max ------------------------------------------------------------ 1048576 8 66.1363 23.5214 34.2000 96.6600 4096 8 52.9588 16.2556 35.5400 84.3300 algorithm: Level N Mean Std Min Max ------------------------------------------------------------ recursive_doubling 8 60.3663 21.3574 35.5400 91.5800 ring 8 58.7287 21.4256 34.2000 96.6600 ppn: Level N Mean Std Min Max ------------------------------------------------------------ 16 8 62.5125 24.6412 34.2000 96.6600 64 8 56.5825 17.0131 39.3600 91.5800 eager_limit: Level N Mean Std Min Max ------------------------------------------------------------ 262144 8 71.4075 21.6409 35.5400 96.6600 4096 8 47.6875 11.2646 34.2000 67.4600 binding: Level N Mean Std Min Max ------------------------------------------------------------ core 8 51.8663 16.2748 34.2000 84.3300 socket 8 67.2288 22.7375 39.6400 96.6600 coll_tuning: Level N Mean Std Min Max ------------------------------------------------------------ off 8 57.5588 22.0870 34.2000 91.5800 on 8 61.5362 20.4902 39.3600 96.6600 === Main Effects: bandwidth === Factor Effect Std Error % Contribution -------------------------------------------------------------- eager_limit -4.2275 1.7031 31.1% ppn -3.5800 1.7031 26.4% msg_size -2.9300 1.7031 21.6% coll_tuning -1.0375 1.7031 7.6% binding 0.9575 1.7031 7.1% algorithm 0.8400 1.7031 6.2% === Interaction Effects: bandwidth === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ msg_size coll_tuning 5.5025 12.6% algorithm binding -5.5025 12.6% ppn eager_limit -5.5025 12.6% msg_size binding -4.2275 9.7% algorithm coll_tuning 4.2275 9.7% msg_size algorithm -3.5800 8.2% binding coll_tuning 3.5800 8.2% algorithm ppn -2.9300 6.7% eager_limit binding -2.9300 6.7% algorithm eager_limit 1.0375 2.4% ppn binding 1.0375 2.4% msg_size eager_limit 0.9575 2.2% ppn coll_tuning -0.9575 2.2% msg_size ppn 0.8400 1.9% eager_limit coll_tuning -0.8400 1.9% === Summary Statistics: bandwidth === msg_size: Level N Mean Std Min Max ------------------------------------------------------------ 1048576 8 19.9925 6.5195 9.5400 30.1400 4096 8 17.0625 7.2134 9.2100 29.9100 algorithm: Level N Mean Std Min Max ------------------------------------------------------------ recursive_doubling 8 18.1075 5.6671 9.5400 24.5600 ring 8 18.9475 8.1807 9.2100 30.1400 ppn: Level N Mean Std Min Max ------------------------------------------------------------ 16 8 20.3175 5.5034 13.5000 30.1400 64 8 16.7375 7.8634 9.2100 29.9100 eager_limit: Level N Mean Std Min Max ------------------------------------------------------------ 262144 8 20.6412 6.1679 9.2100 29.9100 4096 8 16.4138 7.1546 9.5400 30.1400 binding: Level N Mean Std Min Max ------------------------------------------------------------ core 8 18.0488 8.1889 9.2100 30.1400 socket 8 19.0062 5.6445 11.2800 24.5600 coll_tuning: Level N Mean Std Min Max ------------------------------------------------------------ off 8 19.0463 6.5192 11.2800 30.1400 on 8 18.0087 7.5052 9.2100 29.9100

Optimization Recommendations

doe optimize
=== Optimization: allreduce_latency === Direction: minimize Best observed run: #2 msg_size = 1048576 algorithm = recursive_doubling ppn = 64 eager_limit = 4096 binding = core coll_tuning = on Value: 34.2 RSM Model (linear, R² = 0.27): Coefficients: intercept: +59.5475 msg_size: -2.3600 algorithm: -2.4125 ppn: +4.3000 eager_limit: -4.6425 binding: +6.7700 coll_tuning: +3.4600 Predicted optimum: msg_size = 4096 algorithm = recursive_doubling ppn = 16 eager_limit = 4096 binding = socket coll_tuning = on Predicted value: 74.8925 Factor importance: 1. binding (effect: 13.5, contribution: 28.3%) 2. eager_limit (effect: 9.3, contribution: 19.4%) 3. ppn (effect: 8.6, contribution: 18.0%) 4. coll_tuning (effect: 6.9, contribution: 14.4%) 5. algorithm (effect: -4.8, contribution: 10.1%) 6. msg_size (effect: 4.7, contribution: 9.9%) === Optimization: bandwidth === Direction: maximize Best observed run: #15 msg_size = 1048576 algorithm = ring ppn = 16 eager_limit = 262144 binding = socket coll_tuning = on Value: 30.14 RSM Model (linear, R² = 0.21): Coefficients: intercept: +18.5275 msg_size: +0.1463 algorithm: +0.9300 ppn: -1.3213 eager_limit: -1.9263 binding: +0.9800 coll_tuning: +1.3487 Predicted optimum: msg_size = 4096 algorithm = recursive_doubling ppn = 16 eager_limit = 4096 binding = socket coll_tuning = on Predicted value: 23.0275 Factor importance: 1. eager_limit (effect: 3.9, contribution: 29.0%) 2. coll_tuning (effect: 2.7, contribution: 20.3%) 3. ppn (effect: -2.6, contribution: 19.9%) 4. binding (effect: 2.0, contribution: 14.7%) 5. algorithm (effect: 1.9, contribution: 14.0%) 6. msg_size (effect: -0.3, contribution: 2.2%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.8027

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
allreduce_latency 1.0
0.6288
56.58 0.6288 56.58 us
bandwidth 1.5
0.9446
29.91 0.9446 29.91 GB/s

Recommended Settings

FactorValue
msg_size1048576 bytes
algorithmrecursive_doubling
ppn64
eager_limit4096 bytes
bindingcore
coll_tuningon

Source: from observed run #6

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
bandwidth29.9130.14+0.23

Top 3 Runs by Desirability

RunDFactor Settings
#150.8024msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=4096, binding=socket, coll_tuning=on
#160.6033msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=262144, binding=core, coll_tuning=off

Model Quality

ResponseType
bandwidth0.3071linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.8027 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- allreduce_latency 1.0 0.6288 56.58 us ↓ bandwidth 1.5 0.9446 29.91 GB/s ↑ Recommended settings: msg_size = 1048576 bytes algorithm = recursive_doubling ppn = 64 eager_limit = 4096 bytes binding = core coll_tuning = on (from observed run #6) Trade-off summary: allreduce_latency: 56.58 (best observed: 34.20, sacrifice: +22.38) bandwidth: 29.91 (best observed: 30.14, sacrifice: +0.23) Model quality: allreduce_latency: R² = 0.2755 (linear) bandwidth: R² = 0.3071 (linear) Top 3 observed runs by overall desirability: 1. Run #6 (D=0.8027): msg_size=1048576, algorithm=recursive_doubling, ppn=64, eager_limit=4096, binding=core, coll_tuning=on 2. Run #15 (D=0.8024): msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=4096, binding=socket, coll_tuning=on 3. Run #16 (D=0.6033): msg_size=4096, algorithm=recursive_doubling, ppn=16, eager_limit=262144, binding=core, coll_tuning=off
← All Use Cases GPU Kernel Optimization →