← All Use Cases
🕸
Plackett-Burman

Interconnect Topology & Adaptive Routing

Identify dominant factors in dragonfly network routing performance for HPC clusters.

Summary

This experiment investigates interconnect topology & adaptive routing. Plackett-Burman screening design to identify dominant factors in dragonfly network routing performance.

The design varies 5 factors: routing mode, ranging from minimal to adaptive, vc count, ranging from 2 to 8, traffic pattern, ranging from nearest_neighbor to alltoall, link bandwidth (Gbps), ranging from 100 to 200, and job placement, ranging from compact to scatter. The goal is to optimize 2 responses: msg latency us (us) (minimize) and throughput GBs (GB/s) (maximize). Fixed conditions held constant across all runs include topology = dragonfly, groups = 16, switches per group = 48, nodes = 3072.

A Plackett-Burman screening design was used to efficiently test 5 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.

Key Findings

For msg latency us, the most influential factors were job placement (38.4%), routing mode (27.4%), traffic pattern (23.5%). The best observed value was 2.53 (at routing mode = adaptive, vc count = 8, traffic pattern = alltoall).

For throughput GBs, the most influential factors were traffic pattern (44.4%), job placement (28.4%), routing mode (14.4%). The best observed value was 1104.56 (at routing mode = adaptive, vc count = 2, traffic pattern = nearest_neighbor).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
routing_modeminimal, adaptivecategorical
vc_count2, 8continuous
traffic_patternnearest_neighbor, alltoallcategorical
link_bandwidth100, 200continuousGbps
job_placementcompact, scattercategorical

Fixed: topology = dragonfly, groups = 16, switches_per_group = 48, nodes = 3072

Responses

ResponseDirectionUnit
msg_latency_us↓ minimizeus
throughput_GBs↑ maximizeGB/s

Experimental Matrix

The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.

RunBlockrouting_modevc_counttraffic_patternlink_bandwidthjob_placement
11adaptive8alltoall100compact
21minimal2alltoall200compact
31minimal8nearest_neighbor200compact
41adaptive8alltoall200scatter
51minimal8nearest_neighbor100scatter
61adaptive2nearest_neighbor200scatter
71minimal2alltoall100scatter
81adaptive2nearest_neighbor100compact
92adaptive8alltoall100compact
102adaptive8alltoall200scatter
112minimal8nearest_neighbor100scatter
122minimal2alltoall100scatter
132minimal2alltoall200compact
142minimal8nearest_neighbor200compact
152adaptive2nearest_neighbor200scatter
162adaptive2nearest_neighbor100compact

How to Run

terminal
$ doe info --config use_cases/18_interconnect_topology_routing/config.json $ doe generate --config use_cases/18_interconnect_topology_routing/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/18_interconnect_topology_routing/config.json $ doe optimize --config use_cases/18_interconnect_topology_routing/config.json $ doe optimize --config use_cases/18_interconnect_topology_routing/config.json --multi # multi-objective $ doe report --config use_cases/18_interconnect_topology_routing/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: msg_latency_us

Pareto Chart

Pareto chart for msg_latency_us

Main Effects Plot

Main effects plot for msg_latency_us

Response: throughput_GBs

Pareto Chart

Pareto chart for throughput_GBs

Main Effects Plot

Main effects plot for throughput_GBs

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

msg_latency_us: vc_count vs link_bandwidth

RSM surface: msg_latency_us: vc_count vs link_bandwidth

throughput_GBs: vc_count vs link_bandwidth

RSM surface: throughput_GBs: vc_count vs link_bandwidth

Full Analysis Output

doe analyze
=== Main Effects: msg_latency_us === Factor Effect Std Error % Contribution -------------------------------------------------------------- link_bandwidth 1.4000 0.5394 53.9% vc_count -0.4225 0.5394 16.3% traffic_pattern 0.2725 0.5394 10.5% routing_mode 0.2625 0.5394 10.1% job_placement 0.2400 0.5394 9.2% === Interaction Effects: msg_latency_us === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ routing_mode job_placement -1.4000 36.6% routing_mode traffic_pattern -0.4225 11.0% vc_count link_bandwidth -0.3250 8.5% traffic_pattern job_placement 0.3250 8.5% routing_mode vc_count 0.2725 7.1% vc_count traffic_pattern 0.2625 6.9% link_bandwidth job_placement -0.2625 6.9% routing_mode link_bandwidth -0.2400 6.3% vc_count job_placement 0.1600 4.2% traffic_pattern link_bandwidth -0.1600 4.2% === Summary Statistics: msg_latency_us === routing_mode: Level N Mean Std Min Max ------------------------------------------------------------ adaptive 8 4.7700 2.2452 2.5300 9.5100 minimal 8 5.0325 2.2124 2.6900 9.1700 vc_count: Level N Mean Std Min Max ------------------------------------------------------------ 2 8 5.1125 2.7516 2.5300 9.5100 8 8 4.6900 1.5171 3.4100 7.1800 traffic_pattern: Level N Mean Std Min Max ------------------------------------------------------------ alltoall 8 4.7650 2.1716 2.6900 9.1700 nearest_neighbor 8 5.0375 2.2840 2.5300 9.5100 link_bandwidth: Level N Mean Std Min Max ------------------------------------------------------------ 100 8 4.2012 0.7444 3.3700 5.4200 200 8 5.6013 2.8811 2.5300 9.5100 job_placement: Level N Mean Std Min Max ------------------------------------------------------------ compact 8 4.7812 2.2149 2.6900 9.1700 scatter 8 5.0213 2.2442 2.5300 9.5100 === Main Effects: throughput_GBs === Factor Effect Std Error % Contribution -------------------------------------------------------------- link_bandwidth -98.4525 41.3864 38.5% routing_mode 97.8875 41.3864 38.3% job_placement 36.9425 41.3864 14.4% vc_count 21.0850 41.3864 8.2% traffic_pattern 1.4450 41.3864 0.6% === Interaction Effects: throughput_GBs === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ routing_mode job_placement 98.4525 20.1% vc_count traffic_pattern 97.8875 20.0% link_bandwidth job_placement -97.8875 20.0% vc_count job_placement -59.2600 12.1% traffic_pattern link_bandwidth 59.2600 12.1% routing_mode link_bandwidth -36.9425 7.5% routing_mode traffic_pattern 21.0850 4.3% vc_count link_bandwidth 8.5850 1.8% traffic_pattern job_placement -8.5850 1.8% routing_mode vc_count 1.4450 0.3% === Summary Statistics: throughput_GBs === routing_mode: Level N Mean Std Min Max ------------------------------------------------------------ adaptive 8 852.9575 147.9771 587.2800 1088.1600 minimal 8 950.8450 177.0681 606.3400 1104.5600 vc_count: Level N Mean Std Min Max ------------------------------------------------------------ 2 8 891.3587 205.2416 587.2800 1088.1600 8 8 912.4438 127.8577 745.4700 1104.5600 traffic_pattern: Level N Mean Std Min Max ------------------------------------------------------------ alltoall 8 901.1788 164.5003 606.3400 1073.5200 nearest_neighbor 8 902.6237 177.9444 587.2800 1104.5600 link_bandwidth: Level N Mean Std Min Max ------------------------------------------------------------ 100 8 951.1275 87.3484 845.6000 1073.5200 200 8 852.6750 213.4414 587.2800 1104.5600 job_placement: Level N Mean Std Min Max ------------------------------------------------------------ compact 8 883.4300 164.4862 606.3400 1104.5600 scatter 8 920.3725 175.7561 587.2800 1088.1600

Optimization Recommendations

doe optimize
=== Optimization: msg_latency_us === Direction: minimize Best observed run: #6 routing_mode = minimal vc_count = 2 traffic_pattern = alltoall link_bandwidth = 200 job_placement = compact Value: 2.53 Factor importance: 1. job_placement (effect: 1.6, contribution: 36.3%) 2. traffic_pattern (effect: 1.2, contribution: 26.1%) 3. link_bandwidth (effect: -0.8, contribution: 19.2%) 4. vc_count (effect: 0.7, contribution: 15.7%) 5. routing_mode (effect: 0.1, contribution: 2.7%) === Optimization: throughput_GBs === Direction: maximize Best observed run: #3 routing_mode = minimal vc_count = 8 traffic_pattern = nearest_neighbor link_bandwidth = 200 job_placement = compact Value: 1104.56 Factor importance: 1. job_placement (effect: -153.5, contribution: 46.2%) 2. routing_mode (effect: 97.3, contribution: 29.3%) 3. link_bandwidth (effect: 40.1, contribution: 12.1%) 4. traffic_pattern (effect: -30.6, contribution: 9.2%) 5. vc_count (effect: -10.6, contribution: 3.2%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9584

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
msg_latency_us 1.0
0.9929
2.24 0.9929 2.24 us
throughput_GBs 1.5
0.9360
1094.03 0.9360 1094.03 GB/s

Recommended Settings

FactorValue
routing_modeminimal
vc_count2.377
traffic_patternalltoall
link_bandwidth100.3 Gbps
job_placementcompact

Source: from RSM model prediction

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
throughput_GBs1094.031104.56+10.53

Top 3 Runs by Desirability

RunDFactor Settings
#150.9091routing_mode=adaptive, vc_count=8, traffic_pattern=alltoall, link_bandwidth=200, job_placement=scatter
#30.9069routing_mode=minimal, vc_count=2, traffic_pattern=alltoall, link_bandwidth=200, job_placement=compact

Model Quality

ResponseType
throughput_GBs0.3482linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9584 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- msg_latency_us 1.0 0.9929 2.24 us ↓ throughput_GBs 1.5 0.9360 1094.03 GB/s ↑ Recommended settings: routing_mode = minimal vc_count = 2.377 traffic_pattern = alltoall link_bandwidth = 100.3 Gbps job_placement = compact (from RSM model prediction) Trade-off summary: msg_latency_us: 2.24 (best observed: 2.53, sacrifice: -0.29) throughput_GBs: 1094.03 (best observed: 1104.56, sacrifice: +10.53) Model quality: msg_latency_us: R² = 0.4884 (linear) throughput_GBs: R² = 0.3482 (linear) Top 3 observed runs by overall desirability: 1. Run #6 (D=0.9371): routing_mode=adaptive, vc_count=2, traffic_pattern=nearest_neighbor, link_bandwidth=100, job_placement=compact 2. Run #15 (D=0.9091): routing_mode=adaptive, vc_count=8, traffic_pattern=alltoall, link_bandwidth=200, job_placement=scatter 3. Run #3 (D=0.9069): routing_mode=minimal, vc_count=2, traffic_pattern=alltoall, link_bandwidth=200, job_placement=compact
← CPU Cross-NUMA Bandwidth Next: Hardware & Software Prefetch Tuning →