GPU Compute-Communication Overlap

Summary

This experiment investigates gpu compute-communication overlap. Full factorial design to maximize GPU computation and inter-node communication overlap in distributed stencil codes.

The design varies 4 factors: num streams, ranging from 1 to 4, gdrdma, ranging from off to on, chunk count, ranging from 1 to 8, and kernel fusion, ranging from off to on. The goal is to optimize 2 responses: overlap efficiency (%) (maximize) and step time ms (ms) (minimize). Fixed conditions held constant across all runs include gpus = 64, gpu model = H100_SXM, interconnect = NDR_InfiniBand, problem size = 2048^3.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For overlap efficiency, the most influential factors were kernel fusion (45.5%), gdrdma (32.0%), num streams (14.5%). The best observed value was 71.12 (at num streams = 1, gdrdma = off, chunk count = 8).

For step time ms, the most influential factors were kernel fusion (51.7%), gdrdma (26.5%), num streams (21.1%). The best observed value was 47.11 (at num streams = 1, gdrdma = off, chunk count = 8).

Recommended Next Steps

Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Levels	Type	Unit
num_streams	1 – 4	continuous	—
gdrdma	off / on	categorical	—
chunk_count	1 – 8	continuous	—
kernel_fusion	off / on	categorical	—

Fixed: gpus = 64, gpu_model = H100 SXM, interconnect = NDR InfiniBand, problem_size = 2048^3

Responses

Response	Direction	Unit
overlap_efficiency	↑ maximize	%
step_time_ms	↓ minimize	ms

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Run	`num_streams`	`gdrdma`	`chunk_count`	`kernel_fusion`
1	1	on	8	on
2	4	off	1	on
3	1	on	1	on
4	1	on	8	off
5	4	on	8	off
6	4	off	8	off
7	4	on	1	off
8	4	off	1	off
9	1	off	1	on
10	1	off	8	off
11	4	on	1	on
12	4	on	8	on
13	1	on	1	off
14	4	off	8	on
15	1	off	1	off
16	1	off	8	on

How to Run

terminal
$ doe info --config use_cases/24_gpu_comm_overlap/config.json
$ doe generate --config use_cases/24_gpu_comm_overlap/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/24_gpu_comm_overlap/config.json
$ doe optimize --config use_cases/24_gpu_comm_overlap/config.json
$ doe optimize --config use_cases/24_gpu_comm_overlap/config.json --multi  # multi-objective
$ doe report --config use_cases/24_gpu_comm_overlap/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: overlap_efficiency

Pareto Chart

Main Effects Plot

Response: step_time_ms

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

overlap_efficiency: num_streams vs chunk_count

step_time_ms: num_streams vs chunk_count

Full Analysis Output

doe analyze
=== Main Effects: overlap_efficiency ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
kernel_fusion          -12.7575       3.0791            56.6%
gdrdma                  -4.9700       3.0791            22.0%
num_streams             -2.7900       3.0791            12.4%
chunk_count              2.0400       3.0791             9.0%

=== Interaction Effects: overlap_efficiency ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
gdrdma               chunk_count              -10.5825            33.9%
chunk_count          kernel_fusion             -9.8300            31.5%
gdrdma               kernel_fusion             -3.8400            12.3%
num_streams          chunk_count               -3.6825            11.8%
num_streams          gdrdma                    -2.2225             7.1%
num_streams          kernel_fusion             -1.0400             3.3%

=== Summary Statistics: overlap_efficiency ===

num_streams:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   8    48.1863    13.4865    32.1300    71.1200
  4                   8    45.3963    11.7783    24.5600    61.8500

gdrdma:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    49.2763    12.9130    35.7000    71.1200
  on                  8    44.3062    12.0084    24.5600    62.0000

chunk_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   8    45.7713     8.3288    35.7000    62.0000
  8                   8    47.8113    15.9158    24.5600    71.1200

kernel_fusion:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    53.1700    11.5749    35.7000    71.1200
  on                  8    40.4125     9.9036    24.5600    53.5800

=== Main Effects: step_time_ms ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
kernel_fusion           10.7475       2.5064            55.8%
gdrdma                   3.8800       2.5064            20.2%
num_streams              2.4975       2.5064            13.0%
chunk_count             -2.1250       2.5064            11.0%

=== Interaction Effects: step_time_ms ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
gdrdma               chunk_count                8.2026            32.6%
chunk_count          kernel_fusion              7.9900            31.8%
num_streams          chunk_count                3.7750            15.0%
gdrdma               kernel_fusion              2.6250            10.4%
num_streams          kernel_fusion              1.5425             6.1%
num_streams          gdrdma                     1.0100             4.0%

=== Summary Statistics: step_time_ms ===

num_streams:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   8    65.9675    10.6844    47.1100    78.0900
  4                   8    68.4650     9.8823    54.5400    85.2000

gdrdma:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    65.2763    10.9366    47.1100    78.1600
  on                  8    69.1562     9.3364    56.5000    85.2000

chunk_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   8    68.2788     6.1498    56.5000    75.6200
  8                   8    66.1538    13.2280    47.1100    85.2000

kernel_fusion:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    61.8425     9.1360    47.1100    75.6200
  on                  8    72.5900     8.1185    60.6400    85.2000

Optimization Recommendations

doe optimize
=== Optimization: overlap_efficiency ===
Direction: maximize

Best observed run: #12
  num_streams = 1
  gdrdma = off
  chunk_count = 8
  kernel_fusion = off
  Value: 71.12

Factor importance:
  1. gdrdma  (effect: -4.2, contribution: 29.9%)
  2. chunk_count  (effect: 4.0, contribution: 28.5%)
  3. num_streams  (effect: -3.8, contribution: 26.8%)
  4. kernel_fusion  (effect: 2.1, contribution: 14.8%)

=== Optimization: step_time_ms ===
Direction: minimize

Best observed run: #12
  num_streams = 1
  gdrdma = off
  chunk_count = 8
  kernel_fusion = off
  Value: 47.11

Factor importance:
  1. num_streams  (effect: 4.0, contribution: 34.2%)
  2. gdrdma  (effect: 3.3, contribution: 28.6%)
  3. chunk_count  (effect: -2.5, contribution: 21.4%)
  4. kernel_fusion  (effect: -1.8, contribution: 15.8%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.9545

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`overlap_efficiency`	1.5	0.9545	71.12 0.9545 71.12 %	↑
`step_time_ms`	1.0	0.9545	47.11 0.9545 47.11 ms	↓

Recommended Settings

Factor	Value
`num_streams`	4
`gdrdma`	off
`chunk_count`	8
`kernel_fusion`	on

Source: from observed run #12

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`step_time_ms`	47.11	47.11	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#5	0.7750	num_streams=4, gdrdma=on, chunk_count=8, kernel_fusion=off
#11	0.7577	num_streams=1, gdrdma=off, chunk_count=1, kernel_fusion=off

Model Quality

Response	R²	Type
`step_time_ms`	0.1129	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.9545

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
overlap_efficiency           1.5       0.9545       71.12 %   ↑
step_time_ms                 1.0       0.9545       47.11 ms   ↓

Recommended settings:
  num_streams = 4
  gdrdma = off
  chunk_count = 8
  kernel_fusion = on
  (from observed run #12)

Trade-off summary:
  overlap_efficiency: 71.12 (best observed: 71.12, sacrifice: +0.00)
  step_time_ms: 47.11 (best observed: 47.11, sacrifice: +0.00)

Model quality:
  overlap_efficiency: R² = 0.1046 (linear)
  step_time_ms: R² = 0.1129 (linear)

Top 3 observed runs by overall desirability:
  1. Run #12 (D=0.9545): num_streams=4, gdrdma=off, chunk_count=8, kernel_fusion=on
  2. Run #5 (D=0.7750): num_streams=4, gdrdma=on, chunk_count=8, kernel_fusion=off
  3. Run #11 (D=0.7577): num_streams=1, gdrdma=off, chunk_count=1, kernel_fusion=off

Run	`num_streams`	`gdrdma`	`chunk_count`	`kernel_fusion`
1	1	on	8	on
2	4	off	1	on
3	1	on	1	on
4	1	on	8	off
5	4	on	8	off
6	4	off	8	off
7	4	on	1	off
8	4	off	1	off
9	1	off	1	on
10	1	off	8	off
11	4	on	1	on
12	4	on	8	on
13	1	on	1	off
14	4	off	8	on
15	1	off	1	off
16	1	off	8	on

Run	`num_streams`	`gdrdma`	`chunk_count`	`kernel_fusion`
1	1	on	8	on
2	4	off	1	on
3	1	on	1	on
4	1	on	8	off
5	4	on	8	off
6	4	off	8	off
7	4	on	1	off
8	4	off	1	off
9	1	off	1	on
10	1	off	8	off
11	4	on	1	on
12	4	on	8	on
13	1	on	1	off
14	4	off	8	on
15	1	off	1	off
16	1	off	8	on

Run	`num_streams`	`gdrdma`	`chunk_count`	`kernel_fusion`
1	1	on	8	on
2	4	off	1	on
3	1	on	1	on
4	1	on	8	off
5	4	on	8	off
6	4	off	8	off
7	4	on	1	off
8	4	off	1	off
9	1	off	1	on
10	1	off	8	off
11	4	on	1	on
12	4	on	8	on
13	1	on	1	off
14	4	off	8	on
15	1	off	1	off
16	1	off	8	on