← All Use Cases
🎮
Full Factorial

GPU Kernel Optimization

Optimize CUDA kernel launch parameters for maximum throughput on NVIDIA GPUs.

Summary

This experiment investigates gpu kernel optimization. Full factorial design to optimize GPU kernel launch parameters for maximum throughput and occupancy.

The design varies 4 factors: block size (threads), ranging from 128 to 512, shared mem (KB), ranging from 16 to 48, unroll factor, ranging from 2 to 8, and precision, ranging from fp32 to fp64. The goal is to optimize 2 responses: gflops (GFLOPS) (maximize) and occupancy (%) (maximize). Fixed conditions held constant across all runs include gpu model = A100, problem size = 8192.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For gflops, the most influential factors were block size (41.5%), precision (28.0%), shared mem (27.2%). The best observed value was 705.9 (at block size = 128, shared mem = 16, unroll factor = 2).

For occupancy, the most influential factors were shared mem (39.4%), block size (25.9%), precision (22.7%). The best observed value was 77.0 (at block size = 128, shared mem = 48, unroll factor = 2).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
block_size128, 512continuousthreads
shared_mem16, 48continuousKB
unroll_factor2, 8continuous
precisionfp32, fp64categorical

Fixed: gpu_model=A100, problem_size=8192

Responses

ResponseDirectionUnit
gflops↑ maximizeGFLOPS
occupancy↑ maximize%

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Runblock_sizeshared_memunroll_factorprecision
1128488fp64
2512162fp64
3128482fp64
4128488fp32
5512488fp32
6512168fp32
7512482fp32
8512162fp32
9128162fp64
10128168fp32
11512482fp64
12512488fp64
13128482fp32
14512168fp64
15128162fp32
16128168fp64

How to Run

terminal
$ doe info --config use_cases/08_gpu_kernel_optimization/config.json $ doe generate --config use_cases/08_gpu_kernel_optimization/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/08_gpu_kernel_optimization/config.json $ doe optimize --config use_cases/08_gpu_kernel_optimization/config.json $ doe report --config use_cases/08_gpu_kernel_optimization/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: gflops

Pareto Chart

Pareto chart for gflops

Main Effects Plot

Main effects plot for gflops

Response: occupancy

Pareto Chart

Pareto chart for occupancy

Main Effects Plot

Main effects plot for occupancy

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

📊

How to Read These Surfaces

Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.

  • Flat surface — these two factors have little effect on the response.
  • Tilted plane — strong linear effect; moving along one axis consistently changes the response.
  • Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
  • Saddle shape — significant interaction; the best setting of one factor depends on the other.
  • Red dots far from surface — poor model fit in that region; be cautious about predictions there.

gflops (GFLOPS) — R² = 0.941, Adj R² = 0.118
The model fits well — the surface shape is reliable.
Curvature detected in block_size, shared_mem — look for a peak or valley in the surface.
Strongest linear driver: unroll_factor (increases gflops).
Notable interaction: block_size × precision — the effect of one depends on the level of the other. Look for a twisted surface.

occupancy (%) — R² = 0.775, Adj R² = -2.381
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in unroll_factor, block_size — look for a peak or valley in the surface.
Strongest linear driver: shared_mem (decreases occupancy).
Notable interaction: block_size × shared_mem — the effect of one depends on the level of the other. Look for a twisted surface.

gflops: block size vs shared mem

RSM surface: gflops — block size vs shared mem

gflops: block size vs unroll factor

RSM surface: gflops — block size vs unroll factor

gflops: shared mem vs unroll factor

RSM surface: gflops — shared mem vs unroll factor

occupancy: block size vs shared mem

RSM surface: occupancy — block size vs shared mem

occupancy: block size vs unroll factor

RSM surface: occupancy — block size vs unroll factor

occupancy: shared mem vs unroll factor

RSM surface: occupancy — shared mem vs unroll factor

Full Analysis Output

doe analyze
=== Main Effects: gflops === Factor Effect Std Error % Contribution -------------------------------------------------------------- unroll_factor -117.3000 38.1811 71.4% shared_mem -19.9000 38.1811 12.1% precision -16.7000 38.1811 10.2% block_size 10.3250 38.1811 6.3% === Interaction Effects: gflops === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ block_size precision -134.9000 36.4% shared_mem unroll_factor 98.8750 26.7% block_size shared_mem -80.6500 21.8% block_size unroll_factor 27.4000 7.4% unroll_factor precision -25.0750 6.8% shared_mem precision -3.5750 1.0% === Summary Statistics: gflops === block_size: Level N Mean Std Min Max ------------------------------------------------------------ 128 8 433.5250 178.1231 196.0000 705.9000 512 8 443.8500 134.8807 259.0000 593.5000 shared_mem: Level N Mean Std Min Max ------------------------------------------------------------ 16 8 448.6375 179.4997 196.0000 705.9000 48 8 428.7375 132.4202 252.2000 620.9000 unroll_factor: Level N Mean Std Min Max ------------------------------------------------------------ 2 8 497.3375 145.8686 259.0000 705.9000 8 8 380.0375 144.3657 196.0000 620.9000 precision: Level N Mean Std Min Max ------------------------------------------------------------ fp32 8 447.0375 115.6024 252.2000 592.4000 fp64 8 430.3375 190.9405 196.0000 705.9000 === Main Effects: occupancy === Factor Effect Std Error % Contribution -------------------------------------------------------------- shared_mem -8.6250 3.9148 43.8% precision -5.5750 3.9148 28.3% block_size -2.7500 3.9148 14.0% unroll_factor -2.7250 3.9148 13.9% === Interaction Effects: occupancy === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ shared_mem precision -10.1500 32.7% shared_mem unroll_factor 8.2500 26.6% unroll_factor precision -6.1500 19.8% block_size unroll_factor -4.0250 13.0% block_size precision -2.1250 6.8% block_size shared_mem 0.3250 1.0% === Summary Statistics: occupancy === block_size: Level N Mean Std Min Max ------------------------------------------------------------ 128 8 52.4750 13.5118 28.5000 73.2000 512 8 49.7250 18.3997 28.4000 77.0000 shared_mem: Level N Mean Std Min Max ------------------------------------------------------------ 16 8 55.4125 13.9524 35.0000 77.0000 48 8 46.7875 16.9783 28.4000 72.0000 unroll_factor: Level N Mean Std Min Max ------------------------------------------------------------ 2 8 52.4625 20.4763 28.4000 77.0000 8 8 49.7375 10.0955 29.1000 65.3000 precision: Level N Mean Std Min Max ------------------------------------------------------------ fp32 8 53.8875 16.2604 28.5000 73.2000 fp64 8 48.3125 15.5974 28.4000 77.0000

Optimization Recommendations

doe optimize
=== Optimization: gflops === Direction: maximize Best observed run: #6 block_size = 512 shared_mem = 16 unroll_factor = 2 precision = fp32 Value: 705.9 RSM Model (linear, R² = 0.30): Coefficients: intercept: +438.6875 block_size: +0.1250 shared_mem: -69.4875 unroll_factor: -13.3125 precision: -40.0625 Predicted optimum: block_size = 512 shared_mem = 16 unroll_factor = 2 precision = fp32 Predicted value: 561.6750 Factor importance: 1. shared_mem (effect: -139.0, contribution: 56.5%) 2. precision (effect: -80.1, contribution: 32.6%) 3. unroll_factor (effect: -26.6, contribution: 10.8%) 4. block_size (effect: 0.2, contribution: 0.1%) === Optimization: occupancy === Direction: maximize Best observed run: #10 block_size = 128 shared_mem = 16 unroll_factor = 2 precision = fp32 Value: 77.0 RSM Model (linear, R² = 0.48): Coefficients: intercept: +51.1000 block_size: -4.9875 shared_mem: -6.6000 unroll_factor: -4.4750 precision: -4.7250 Predicted optimum: block_size = 128 shared_mem = 16 unroll_factor = 2 precision = fp32 Predicted value: 71.8875 Factor importance: 1. shared_mem (effect: -13.2, contribution: 31.7%) 2. block_size (effect: -10.0, contribution: 24.0%) 3. precision (effect: -9.5, contribution: 22.7%) 4. unroll_factor (effect: -9.0, contribution: 21.5%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.8485

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
gflops 1.5
0.7541
593.50 0.7541 593.50 GFLOPS
occupancy 1.5
0.9545
77.00 0.9545 77.00 %

Recommended Settings

FactorValue
block_size128 threads
shared_mem48 KB
unroll_factor8
precisionfp64

Source: from observed run #10

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
occupancy77.0077.00+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#60.7610block_size=512, shared_mem=16, unroll_factor=8, precision=fp64
#150.7222block_size=128, shared_mem=16, unroll_factor=2, precision=fp64

Model Quality

ResponseType
occupancy0.2311linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.8485 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- gflops 1.5 0.7541 593.50 GFLOPS ↑ occupancy 1.5 0.9545 77.00 % ↑ Recommended settings: block_size = 128 threads shared_mem = 48 KB unroll_factor = 8 precision = fp64 (from observed run #10) Trade-off summary: gflops: 593.50 (best observed: 705.90, sacrifice: +112.40) occupancy: 77.00 (best observed: 77.00, sacrifice: +0.00) Model quality: gflops: R² = 0.4499 (linear) occupancy: R² = 0.2311 (linear) Top 3 observed runs by overall desirability: 1. Run #10 (D=0.8485): block_size=128, shared_mem=48, unroll_factor=8, precision=fp64 2. Run #6 (D=0.7610): block_size=512, shared_mem=16, unroll_factor=8, precision=fp64 3. Run #15 (D=0.7222): block_size=128, shared_mem=16, unroll_factor=2, precision=fp64
← MPI Collective Tuning Parallel I/O Tuning →