Summary
This experiment investigates gpu kernel optimization. Full factorial design to optimize GPU kernel launch parameters for maximum throughput and occupancy.
The design varies 4 factors: block size (threads), ranging from 128 to 512, shared mem (KB), ranging from 16 to 48, unroll factor, ranging from 2 to 8, and precision, ranging from fp32 to fp64. The goal is to optimize 2 responses: gflops (GFLOPS) (maximize) and occupancy (%) (maximize). Fixed conditions held constant across all runs include gpu model = A100, problem size = 8192.
A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).
Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.
Key Findings
For gflops, the most influential factors were block size (41.5%), precision (28.0%), shared mem (27.2%). The best observed value was 705.9 (at block size = 128, shared mem = 16, unroll factor = 2).
For occupancy, the most influential factors were shared mem (39.4%), block size (25.9%), precision (22.7%). The best observed value was 77.0 (at block size = 128, shared mem = 48, unroll factor = 2).
Recommended Next Steps
- Consider whether any fixed factors should be varied in a future study.
Experimental Setup
Factors
| Factor | Levels | Type | Unit |
block_size | 128, 512 | continuous | threads |
shared_mem | 16, 48 | continuous | KB |
unroll_factor | 2, 8 | continuous | |
precision | fp32, fp64 | categorical | |
Fixed: gpu_model=A100, problem_size=8192
Responses
| Response | Direction | Unit |
gflops | ↑ maximize | GFLOPS |
occupancy | ↑ maximize | % |
Experimental Matrix
The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.
| Run | block_size | shared_mem | unroll_factor | precision |
| 1 | 128 | 48 | 8 | fp64 |
| 2 | 512 | 16 | 2 | fp64 |
| 3 | 128 | 48 | 2 | fp64 |
| 4 | 128 | 48 | 8 | fp32 |
| 5 | 512 | 48 | 8 | fp32 |
| 6 | 512 | 16 | 8 | fp32 |
| 7 | 512 | 48 | 2 | fp32 |
| 8 | 512 | 16 | 2 | fp32 |
| 9 | 128 | 16 | 2 | fp64 |
| 10 | 128 | 16 | 8 | fp32 |
| 11 | 512 | 48 | 2 | fp64 |
| 12 | 512 | 48 | 8 | fp64 |
| 13 | 128 | 48 | 2 | fp32 |
| 14 | 512 | 16 | 8 | fp64 |
| 15 | 128 | 16 | 2 | fp32 |
| 16 | 128 | 16 | 8 | fp64 |
How to Run
$ doe info --config use_cases/08_gpu_kernel_optimization/config.json
$ doe generate --config use_cases/08_gpu_kernel_optimization/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/08_gpu_kernel_optimization/config.json
$ doe optimize --config use_cases/08_gpu_kernel_optimization/config.json
$ doe report --config use_cases/08_gpu_kernel_optimization/config.json --output report.html
Analysis Results
Generated from actual experiment runs.
Response: gflops
Pareto Chart
Main Effects Plot
Response: occupancy
Pareto Chart
Main Effects Plot
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
📊
How to Read These Surfaces
Each plot shows predicted response (vertical axis) across two factors while other factors are held at center. Red dots are actual experimental observations.
- Flat surface — these two factors have little effect on the response.
- Tilted plane — strong linear effect; moving along one axis consistently changes the response.
- Curved/domed surface — quadratic curvature; there is an optimum somewhere in the middle.
- Saddle shape — significant interaction; the best setting of one factor depends on the other.
- Red dots far from surface — poor model fit in that region; be cautious about predictions there.
gflops (GFLOPS) — R² = 0.941, Adj R² = 0.118
The model fits well — the surface shape is reliable.
Curvature detected in block_size, shared_mem — look for a peak or valley in the surface.
Strongest linear driver: unroll_factor (increases gflops).
Notable interaction: block_size × precision — the effect of one depends on the level of the other. Look for a twisted surface.
occupancy (%) — R² = 0.775, Adj R² = -2.381
Moderate fit — surface shows general trends but some noise remains.
Curvature detected in unroll_factor, block_size — look for a peak or valley in the surface.
Strongest linear driver: shared_mem (decreases occupancy).
Notable interaction: block_size × shared_mem — the effect of one depends on the level of the other. Look for a twisted surface.
gflops: block size vs shared mem
gflops: block size vs unroll factor
gflops: shared mem vs unroll factor
occupancy: block size vs shared mem
occupancy: block size vs unroll factor
occupancy: shared mem vs unroll factor
Full Analysis Output
=== Main Effects: gflops ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
unroll_factor -117.3000 38.1811 71.4%
shared_mem -19.9000 38.1811 12.1%
precision -16.7000 38.1811 10.2%
block_size 10.3250 38.1811 6.3%
=== Interaction Effects: gflops ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
block_size precision -134.9000 36.4%
shared_mem unroll_factor 98.8750 26.7%
block_size shared_mem -80.6500 21.8%
block_size unroll_factor 27.4000 7.4%
unroll_factor precision -25.0750 6.8%
shared_mem precision -3.5750 1.0%
=== Summary Statistics: gflops ===
block_size:
Level N Mean Std Min Max
------------------------------------------------------------
128 8 433.5250 178.1231 196.0000 705.9000
512 8 443.8500 134.8807 259.0000 593.5000
shared_mem:
Level N Mean Std Min Max
------------------------------------------------------------
16 8 448.6375 179.4997 196.0000 705.9000
48 8 428.7375 132.4202 252.2000 620.9000
unroll_factor:
Level N Mean Std Min Max
------------------------------------------------------------
2 8 497.3375 145.8686 259.0000 705.9000
8 8 380.0375 144.3657 196.0000 620.9000
precision:
Level N Mean Std Min Max
------------------------------------------------------------
fp32 8 447.0375 115.6024 252.2000 592.4000
fp64 8 430.3375 190.9405 196.0000 705.9000
=== Main Effects: occupancy ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
shared_mem -8.6250 3.9148 43.8%
precision -5.5750 3.9148 28.3%
block_size -2.7500 3.9148 14.0%
unroll_factor -2.7250 3.9148 13.9%
=== Interaction Effects: occupancy ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
shared_mem precision -10.1500 32.7%
shared_mem unroll_factor 8.2500 26.6%
unroll_factor precision -6.1500 19.8%
block_size unroll_factor -4.0250 13.0%
block_size precision -2.1250 6.8%
block_size shared_mem 0.3250 1.0%
=== Summary Statistics: occupancy ===
block_size:
Level N Mean Std Min Max
------------------------------------------------------------
128 8 52.4750 13.5118 28.5000 73.2000
512 8 49.7250 18.3997 28.4000 77.0000
shared_mem:
Level N Mean Std Min Max
------------------------------------------------------------
16 8 55.4125 13.9524 35.0000 77.0000
48 8 46.7875 16.9783 28.4000 72.0000
unroll_factor:
Level N Mean Std Min Max
------------------------------------------------------------
2 8 52.4625 20.4763 28.4000 77.0000
8 8 49.7375 10.0955 29.1000 65.3000
precision:
Level N Mean Std Min Max
------------------------------------------------------------
fp32 8 53.8875 16.2604 28.5000 73.2000
fp64 8 48.3125 15.5974 28.4000 77.0000
Optimization Recommendations
=== Optimization: gflops ===
Direction: maximize
Best observed run: #6
block_size = 512
shared_mem = 16
unroll_factor = 2
precision = fp32
Value: 705.9
RSM Model (linear, R² = 0.30):
Coefficients:
intercept: +438.6875
block_size: +0.1250
shared_mem: -69.4875
unroll_factor: -13.3125
precision: -40.0625
Predicted optimum:
block_size = 512
shared_mem = 16
unroll_factor = 2
precision = fp32
Predicted value: 561.6750
Factor importance:
1. shared_mem (effect: -139.0, contribution: 56.5%)
2. precision (effect: -80.1, contribution: 32.6%)
3. unroll_factor (effect: -26.6, contribution: 10.8%)
4. block_size (effect: 0.2, contribution: 0.1%)
=== Optimization: occupancy ===
Direction: maximize
Best observed run: #10
block_size = 128
shared_mem = 16
unroll_factor = 2
precision = fp32
Value: 77.0
RSM Model (linear, R² = 0.48):
Coefficients:
intercept: +51.1000
block_size: -4.9875
shared_mem: -6.6000
unroll_factor: -4.4750
precision: -4.7250
Predicted optimum:
block_size = 128
shared_mem = 16
unroll_factor = 2
precision = fp32
Predicted value: 71.8875
Factor importance:
1. shared_mem (effect: -13.2, contribution: 31.7%)
2. block_size (effect: -10.0, contribution: 24.0%)
3. precision (effect: -9.5, contribution: 22.7%)
4. unroll_factor (effect: -9.0, contribution: 21.5%)
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.8485
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
gflops |
1.5 |
|
593.50 0.7541 593.50 GFLOPS |
↑ |
occupancy |
1.5 |
|
77.00 0.9545 77.00 % |
↑ |
Recommended Settings
| Factor | Value |
block_size | 128 threads |
shared_mem | 48 KB |
unroll_factor | 8 |
precision | fp64 |
Source: from observed run #10
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
occupancy | 77.00 | 77.00 | +0.00 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #6 | 0.7610 | block_size=512, shared_mem=16, unroll_factor=8, precision=fp64 |
| #15 | 0.7222 | block_size=128, shared_mem=16, unroll_factor=2, precision=fp64 |
Model Quality
| Response | R² | Type |
occupancy | 0.2311 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.8485
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
gflops 1.5 0.7541 593.50 GFLOPS ↑
occupancy 1.5 0.9545 77.00 % ↑
Recommended settings:
block_size = 128 threads
shared_mem = 48 KB
unroll_factor = 8
precision = fp64
(from observed run #10)
Trade-off summary:
gflops: 593.50 (best observed: 705.90, sacrifice: +112.40)
occupancy: 77.00 (best observed: 77.00, sacrifice: +0.00)
Model quality:
gflops: R² = 0.4499 (linear)
occupancy: R² = 0.2311 (linear)
Top 3 observed runs by overall desirability:
1. Run #10 (D=0.8485): block_size=128, shared_mem=48, unroll_factor=8, precision=fp64
2. Run #6 (D=0.7610): block_size=512, shared_mem=16, unroll_factor=8, precision=fp64
3. Run #15 (D=0.7222): block_size=128, shared_mem=16, unroll_factor=2, precision=fp64