GPU Kernel Optimization

Summary

This experiment investigates gpu kernel optimization. Full factorial design to optimize GPU kernel launch parameters for maximum throughput and occupancy.

The design varies 4 factors: block size (threads), ranging from 128 to 512, shared mem (KB), ranging from 16 to 48, unroll factor, ranging from 2 to 8, and precision, ranging from fp32 to fp64. The goal is to optimize 2 responses: gflops (GFLOPS) (maximize) and occupancy (%) (maximize). Fixed conditions held constant across all runs include gpu model = A100, problem size = 8192.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For gflops, the most influential factors were block size (41.5%), precision (28.0%), shared mem (27.2%). The best observed value was 705.9 (at block size = 128, shared mem = 16, unroll factor = 2).

For occupancy, the most influential factors were shared mem (39.4%), block size (25.9%), precision (22.7%). The best observed value was 77.0 (at block size = 128, shared mem = 48, unroll factor = 2).

Recommended Next Steps

Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Levels	Type	Unit
`block_size`	128, 512	continuous	threads
`shared_mem`	16, 48	continuous	KB
`unroll_factor`	2, 8	continuous
`precision`	fp32, fp64	categorical

Fixed: gpu_model=A100, problem_size=8192

Responses

Response	Direction	Unit
`gflops`	↑ maximize	GFLOPS
`occupancy`	↑ maximize	%

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Run	`block_size`	`shared_mem`	`unroll_factor`	`precision`
1	128	48	8	fp64
2	512	16	2	fp64
3	128	48	2	fp64
4	128	48	8	fp32
5	512	48	8	fp32
6	512	16	8	fp32
7	512	48	2	fp32
8	512	16	2	fp32
9	128	16	2	fp64
10	128	16	8	fp32
11	512	48	2	fp64
12	512	48	8	fp64
13	128	48	2	fp32
14	512	16	8	fp64
15	128	16	2	fp32
16	128	16	8	fp64

How to Run

terminal
$ doe info --config use_cases/08_gpu_kernel_optimization/config.json
$ doe generate --config use_cases/08_gpu_kernel_optimization/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/08_gpu_kernel_optimization/config.json
$ doe optimize --config use_cases/08_gpu_kernel_optimization/config.json
$ doe report --config use_cases/08_gpu_kernel_optimization/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: gflops

Pareto Chart

Main Effects Plot

Response: occupancy

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

Flat surface — these two factors have little effect on the response.
Tilted plane — strong linear effect; moving along one axis consistently changes the response.
Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
Saddle shape — significant interaction; the best setting of one factor depends on the other.
Red dots far from surface — poor model fit in that region; be cautious about predictions there.

gflops (GFLOPS) — R² = 0.941, Adj R² = 0.118
The model fits well — the surface shape is reliable.
Curvature detected in block_size, shared_mem — look for a peak or valley in the surface.
Strongest linear driver: unroll_factor (increases gflops).
Notable interaction: block_size × precision — the effect of one depends on the level of the other. Look for a twisted surface.

occupancy (%) — R² = 0.775, Adj R² = -2.381
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in unroll_factor, block_size — look for a peak or valley in the surface.
Strongest linear driver: shared_mem (decreases occupancy).
Notable interaction: block_size × shared_mem — the effect of one depends on the level of the other. Look for a twisted surface.

gflops: block size vs shared mem

gflops: block size vs unroll factor

gflops: shared mem vs unroll factor

occupancy: block size vs shared mem

occupancy: block size vs unroll factor

occupancy: shared mem vs unroll factor

Full Analysis Output

doe analyze
=== Main Effects: gflops ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
unroll_factor         -117.3000      38.1811            71.4%
shared_mem             -19.9000      38.1811            12.1%
precision              -16.7000      38.1811            10.2%
block_size              10.3250      38.1811             6.3%

=== Interaction Effects: gflops ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
block_size           precision               -134.9000            36.4%
shared_mem           unroll_factor             98.8750            26.7%
block_size           shared_mem               -80.6500            21.8%
block_size           unroll_factor             27.4000             7.4%
unroll_factor        precision                -25.0750             6.8%
shared_mem           precision                 -3.5750             1.0%

=== Summary Statistics: gflops ===

block_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  128                 8   433.5250   178.1231   196.0000   705.9000
  512                 8   443.8500   134.8807   259.0000   593.5000

shared_mem:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  16                  8   448.6375   179.4997   196.0000   705.9000
  48                  8   428.7375   132.4202   252.2000   620.9000

unroll_factor:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2                   8   497.3375   145.8686   259.0000   705.9000
  8                   8   380.0375   144.3657   196.0000   620.9000

precision:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  fp32                8   447.0375   115.6024   252.2000   592.4000
  fp64                8   430.3375   190.9405   196.0000   705.9000

=== Main Effects: occupancy ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
shared_mem              -8.6250       3.9148            43.8%
precision               -5.5750       3.9148            28.3%
block_size              -2.7500       3.9148            14.0%
unroll_factor           -2.7250       3.9148            13.9%

=== Interaction Effects: occupancy ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
shared_mem           precision                -10.1500            32.7%
shared_mem           unroll_factor              8.2500            26.6%
unroll_factor        precision                 -6.1500            19.8%
block_size           unroll_factor             -4.0250            13.0%
block_size           precision                 -2.1250             6.8%
block_size           shared_mem                 0.3250             1.0%

=== Summary Statistics: occupancy ===

block_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  128                 8    52.4750    13.5118    28.5000    73.2000
  512                 8    49.7250    18.3997    28.4000    77.0000

shared_mem:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  16                  8    55.4125    13.9524    35.0000    77.0000
  48                  8    46.7875    16.9783    28.4000    72.0000

unroll_factor:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  2                   8    52.4625    20.4763    28.4000    77.0000
  8                   8    49.7375    10.0955    29.1000    65.3000

precision:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  fp32                8    53.8875    16.2604    28.5000    73.2000
  fp64                8    48.3125    15.5974    28.4000    77.0000

Optimization Recommendations

doe optimize
=== Optimization: gflops ===
Direction: maximize

Best observed run: #6
  block_size = 512
  shared_mem = 16
  unroll_factor = 2
  precision = fp32
  Value: 705.9

RSM Model (linear, R² = 0.30):
  Coefficients:
    intercept:  +438.6875
    block_size:  +0.1250
    shared_mem:  -69.4875
    unroll_factor:  -13.3125
    precision:  -40.0625
  Predicted optimum:
    block_size = 512
    shared_mem = 16
    unroll_factor = 2
    precision = fp32
    Predicted value: 561.6750

Factor importance:
  1. shared_mem  (effect: -139.0, contribution: 56.5%)
  2. precision  (effect: -80.1, contribution: 32.6%)
  3. unroll_factor  (effect: -26.6, contribution: 10.8%)
  4. block_size  (effect: 0.2, contribution: 0.1%)

=== Optimization: occupancy ===
Direction: maximize

Best observed run: #10
  block_size = 128
  shared_mem = 16
  unroll_factor = 2
  precision = fp32
  Value: 77.0

RSM Model (linear, R² = 0.48):
  Coefficients:
    intercept:  +51.1000
    block_size:  -4.9875
    shared_mem:  -6.6000
    unroll_factor:  -4.4750
    precision:  -4.7250
  Predicted optimum:
    block_size = 128
    shared_mem = 16
    unroll_factor = 2
    precision = fp32
    Predicted value: 71.8875

Factor importance:
  1. shared_mem  (effect: -13.2, contribution: 31.7%)
  2. block_size  (effect: -10.0, contribution: 24.0%)
  3. precision  (effect: -9.5, contribution: 22.7%)
  4. unroll_factor  (effect: -9.0, contribution: 21.5%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.8485

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`gflops`	1.5	0.7541	593.50 0.7541 593.50 GFLOPS	↑
`occupancy`	1.5	0.9545	77.00 0.9545 77.00 %	↑

Recommended Settings

Factor	Value
`block_size`	128 threads
`shared_mem`	48 KB
`unroll_factor`	8
`precision`	fp64

Source: from observed run #10

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`occupancy`	77.00	77.00	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#6	0.7610	block_size=512, shared_mem=16, unroll_factor=8, precision=fp64
#15	0.7222	block_size=128, shared_mem=16, unroll_factor=2, precision=fp64

Model Quality

Response	R²	Type
`occupancy`	0.2311	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.8485

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
gflops                       1.5       0.7541      593.50 GFLOPS   ↑
occupancy                    1.5       0.9545       77.00 %   ↑

Recommended settings:
  block_size = 128 threads
  shared_mem = 48 KB
  unroll_factor = 8
  precision = fp64
  (from observed run #10)

Trade-off summary:
  gflops: 593.50 (best observed: 705.90, sacrifice: +112.40)
  occupancy: 77.00 (best observed: 77.00, sacrifice: +0.00)

Model quality:
  gflops: R² = 0.4499 (linear)
  occupancy: R² = 0.2311 (linear)

Top 3 observed runs by overall desirability:
  1. Run #10 (D=0.8485): block_size=128, shared_mem=48, unroll_factor=8, precision=fp64
  2. Run #6 (D=0.7610): block_size=512, shared_mem=16, unroll_factor=8, precision=fp64
  3. Run #15 (D=0.7222): block_size=128, shared_mem=16, unroll_factor=2, precision=fp64