← All Use Cases
🔄
Full Factorial

GPU Compute-Communication Overlap

Maximize GPU computation and inter-node communication overlap in distributed stencil codes.

Summary

This experiment investigates gpu compute-communication overlap. Full factorial design to maximize GPU computation and inter-node communication overlap in distributed stencil codes.

The design varies 4 factors: num streams, ranging from 1 to 4, gdrdma, ranging from off to on, chunk count, ranging from 1 to 8, and kernel fusion, ranging from off to on. The goal is to optimize 2 responses: overlap efficiency (%) (maximize) and step time ms (ms) (minimize). Fixed conditions held constant across all runs include gpus = 64, gpu model = H100_SXM, interconnect = NDR_InfiniBand, problem size = 2048^3.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For overlap efficiency, the most influential factors were kernel fusion (45.5%), gdrdma (32.0%), num streams (14.5%). The best observed value was 71.12 (at num streams = 1, gdrdma = off, chunk count = 8).

For step time ms, the most influential factors were kernel fusion (51.7%), gdrdma (26.5%), num streams (21.1%). The best observed value was 47.11 (at num streams = 1, gdrdma = off, chunk count = 8).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
num_streams1 – 4continuous
gdrdmaoff / oncategorical
chunk_count1 – 8continuous
kernel_fusionoff / oncategorical

Fixed: gpus = 64, gpu_model = H100 SXM, interconnect = NDR InfiniBand, problem_size = 2048^3

Responses

ResponseDirectionUnit
overlap_efficiency↑ maximize%
step_time_ms↓ minimizems

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Runnum_streamsgdrdmachunk_countkernel_fusion
11on8on
24off1on
31on1on
41on8off
54on8off
64off8off
74on1off
84off1off
91off1on
101off8off
114on1on
124on8on
131on1off
144off8on
151off1off
161off8on

How to Run

terminal
$ doe info --config use_cases/24_gpu_comm_overlap/config.json $ doe generate --config use_cases/24_gpu_comm_overlap/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/24_gpu_comm_overlap/config.json $ doe optimize --config use_cases/24_gpu_comm_overlap/config.json $ doe optimize --config use_cases/24_gpu_comm_overlap/config.json --multi # multi-objective $ doe report --config use_cases/24_gpu_comm_overlap/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: overlap_efficiency

Pareto Chart

Pareto chart for overlap_efficiency

Main Effects Plot

Main effects plot for overlap_efficiency

Response: step_time_ms

Pareto Chart

Pareto chart for step_time_ms

Main Effects Plot

Main effects plot for step_time_ms

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

overlap_efficiency: num_streams vs chunk_count

RSM surface: overlap_efficiency: num_streams vs chunk_count

step_time_ms: num_streams vs chunk_count

RSM surface: step_time_ms: num_streams vs chunk_count

Full Analysis Output

doe analyze
=== Main Effects: overlap_efficiency === Factor Effect Std Error % Contribution -------------------------------------------------------------- kernel_fusion -12.7575 3.0791 56.6% gdrdma -4.9700 3.0791 22.0% num_streams -2.7900 3.0791 12.4% chunk_count 2.0400 3.0791 9.0% === Interaction Effects: overlap_efficiency === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ gdrdma chunk_count -10.5825 33.9% chunk_count kernel_fusion -9.8300 31.5% gdrdma kernel_fusion -3.8400 12.3% num_streams chunk_count -3.6825 11.8% num_streams gdrdma -2.2225 7.1% num_streams kernel_fusion -1.0400 3.3% === Summary Statistics: overlap_efficiency === num_streams: Level N Mean Std Min Max ------------------------------------------------------------ 1 8 48.1863 13.4865 32.1300 71.1200 4 8 45.3963 11.7783 24.5600 61.8500 gdrdma: Level N Mean Std Min Max ------------------------------------------------------------ off 8 49.2763 12.9130 35.7000 71.1200 on 8 44.3062 12.0084 24.5600 62.0000 chunk_count: Level N Mean Std Min Max ------------------------------------------------------------ 1 8 45.7713 8.3288 35.7000 62.0000 8 8 47.8113 15.9158 24.5600 71.1200 kernel_fusion: Level N Mean Std Min Max ------------------------------------------------------------ off 8 53.1700 11.5749 35.7000 71.1200 on 8 40.4125 9.9036 24.5600 53.5800 === Main Effects: step_time_ms === Factor Effect Std Error % Contribution -------------------------------------------------------------- kernel_fusion 10.7475 2.5064 55.8% gdrdma 3.8800 2.5064 20.2% num_streams 2.4975 2.5064 13.0% chunk_count -2.1250 2.5064 11.0% === Interaction Effects: step_time_ms === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ gdrdma chunk_count 8.2026 32.6% chunk_count kernel_fusion 7.9900 31.8% num_streams chunk_count 3.7750 15.0% gdrdma kernel_fusion 2.6250 10.4% num_streams kernel_fusion 1.5425 6.1% num_streams gdrdma 1.0100 4.0% === Summary Statistics: step_time_ms === num_streams: Level N Mean Std Min Max ------------------------------------------------------------ 1 8 65.9675 10.6844 47.1100 78.0900 4 8 68.4650 9.8823 54.5400 85.2000 gdrdma: Level N Mean Std Min Max ------------------------------------------------------------ off 8 65.2763 10.9366 47.1100 78.1600 on 8 69.1562 9.3364 56.5000 85.2000 chunk_count: Level N Mean Std Min Max ------------------------------------------------------------ 1 8 68.2788 6.1498 56.5000 75.6200 8 8 66.1538 13.2280 47.1100 85.2000 kernel_fusion: Level N Mean Std Min Max ------------------------------------------------------------ off 8 61.8425 9.1360 47.1100 75.6200 on 8 72.5900 8.1185 60.6400 85.2000

Optimization Recommendations

doe optimize
=== Optimization: overlap_efficiency === Direction: maximize Best observed run: #12 num_streams = 1 gdrdma = off chunk_count = 8 kernel_fusion = off Value: 71.12 Factor importance: 1. gdrdma (effect: -4.2, contribution: 29.9%) 2. chunk_count (effect: 4.0, contribution: 28.5%) 3. num_streams (effect: -3.8, contribution: 26.8%) 4. kernel_fusion (effect: 2.1, contribution: 14.8%) === Optimization: step_time_ms === Direction: minimize Best observed run: #12 num_streams = 1 gdrdma = off chunk_count = 8 kernel_fusion = off Value: 47.11 Factor importance: 1. num_streams (effect: 4.0, contribution: 34.2%) 2. gdrdma (effect: 3.3, contribution: 28.6%) 3. chunk_count (effect: -2.5, contribution: 21.4%) 4. kernel_fusion (effect: -1.8, contribution: 15.8%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9545

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
overlap_efficiency 1.5
0.9545
71.12 0.9545 71.12 %
step_time_ms 1.0
0.9545
47.11 0.9545 47.11 ms

Recommended Settings

FactorValue
num_streams4
gdrdmaoff
chunk_count8
kernel_fusionon

Source: from observed run #12

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
step_time_ms47.1147.11+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#50.7750num_streams=4, gdrdma=on, chunk_count=8, kernel_fusion=off
#110.7577num_streams=1, gdrdma=off, chunk_count=1, kernel_fusion=off

Model Quality

ResponseType
step_time_ms0.1129linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9545 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- overlap_efficiency 1.5 0.9545 71.12 % ↑ step_time_ms 1.0 0.9545 47.11 ms ↓ Recommended settings: num_streams = 4 gdrdma = off chunk_count = 8 kernel_fusion = on (from observed run #12) Trade-off summary: overlap_efficiency: 71.12 (best observed: 71.12, sacrifice: +0.00) step_time_ms: 47.11 (best observed: 47.11, sacrifice: +0.00) Model quality: overlap_efficiency: R² = 0.1046 (linear) step_time_ms: R² = 0.1129 (linear) Top 3 observed runs by overall desirability: 1. Run #12 (D=0.9545): num_streams=4, gdrdma=off, chunk_count=8, kernel_fusion=on 2. Run #5 (D=0.7750): num_streams=4, gdrdma=on, chunk_count=8, kernel_fusion=off 3. Run #11 (D=0.7577): num_streams=1, gdrdma=off, chunk_count=1, kernel_fusion=off
← Previous: Hardware & Software Prefetch Tuning Next: Kubernetes Pod Autoscaling →