Summary
This experiment investigates hardware & software prefetch tuning. Plackett-Burman screening design to identify critical prefetch parameters for memory-bound CFD workloads.
The design varies 6 factors: hw prefetcher, ranging from off to on, sw prefetch dist (bytes), ranging from 64 to 512, l2 stream detect, ranging from off to on, dcache policy, ranging from write_back to write_through, prefetch threads, ranging from 0 to 2, and tlb prefetch, ranging from off to on. The goal is to optimize 2 responses: solver time sec (seconds) (minimize) and cache miss rate (%) (minimize). Fixed conditions held constant across all runs include processor = Xeon_8490H, workload = CFD_structured_grid, grid points = 256M, iterations = 100.
A Plackett-Burman screening design was used to efficiently test 6 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.
Key Findings
For solver time sec, the most influential factors were sw prefetch dist (38.3%), prefetch threads (23.1%), dcache policy (17.8%). The best observed value was 50.73 (at hw prefetcher = on, sw prefetch dist = 64, l2 stream detect = off).
For cache miss rate, the most influential factors were sw prefetch dist (41.6%), dcache policy (21.7%), prefetch threads (20.0%). The best observed value was 1.62 (at hw prefetcher = on, sw prefetch dist = 512, l2 stream detect = on).
Recommended Next Steps
- Follow up with a response surface design (CCD or Box-Behnken) on the top 3–4 factors to model curvature and find the true optimum.
- Consider whether any fixed factors should be varied in a future study.
- The screening results can guide factor reduction — drop factors contributing less than 5% and re-run with a smaller, more focused design.
Experimental Setup
Factors
| Factor | Levels | Type | Unit |
| hw_prefetcher | off / on | categorical | — |
| sw_prefetch_dist | 64 – 512 | continuous | bytes |
| l2_stream_detect | off / on | categorical | — |
| dcache_policy | write_back / write_through | categorical | — |
| prefetch_threads | 0 – 2 | continuous | — |
| tlb_prefetch | off / on | categorical | — |
Fixed: processor = Xeon 8490H, workload = CFD structured grid, grid_points = 256M, iterations = 100
Responses
| Response | Direction | Unit |
| solver_time_sec | ↓ minimize | seconds |
| cache_miss_rate | ↓ minimize | % |
Experimental Matrix
The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.
| Run | Block | hw_prefetcher | sw_prefetch_dist | l2_stream_detect | dcache_policy | prefetch_threads | tlb_prefetch |
| 1 | 1 | on | 512 | on | write_back | 0 | off |
| 2 | 1 | off | 64 | on | write_through | 0 | off |
| 3 | 1 | off | 512 | off | write_through | 0 | on |
| 4 | 1 | on | 512 | on | write_through | 2 | on |
| 5 | 1 | off | 512 | off | write_back | 2 | off |
| 6 | 1 | on | 64 | off | write_through | 2 | off |
| 7 | 1 | off | 64 | on | write_back | 2 | on |
| 8 | 1 | on | 64 | off | write_back | 0 | on |
| 9 | 2 | on | 512 | on | write_back | 0 | off |
| 10 | 2 | on | 512 | on | write_through | 2 | on |
| 11 | 2 | off | 512 | off | write_back | 2 | off |
| 12 | 2 | off | 64 | on | write_back | 2 | on |
| 13 | 2 | off | 64 | on | write_through | 0 | off |
| 14 | 2 | off | 512 | off | write_through | 0 | on |
| 15 | 2 | on | 64 | off | write_through | 2 | off |
| 16 | 2 | on | 64 | off | write_back | 0 | on |
How to Run
$ doe info --config use_cases/22_prefetch_strategy/config.json
$ doe generate --config use_cases/22_prefetch_strategy/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/22_prefetch_strategy/config.json
$ doe optimize --config use_cases/22_prefetch_strategy/config.json
$ doe report --config use_cases/22_prefetch_strategy/config.json --output report.html
Analysis Results
Generated from actual experiment runs.
Response: solver_time_sec
Pareto Chart
Main Effects Plot
Response: cache_miss_rate
Pareto Chart
Main Effects Plot
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
solver_time: sw_prefetch_dist vs prefetch_threads
cache_miss_rate: sw_prefetch_dist vs prefetch_threads
Full Analysis Output
=== Main Effects: solver_time_sec ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
sw_prefetch_dist 13.8375 4.4263 24.6%
l2_stream_detect -12.1125 4.4263 21.5%
prefetch_threads -9.2450 4.4263 16.4%
hw_prefetcher -8.4075 4.4263 14.9%
tlb_prefetch -7.6350 4.4263 13.6%
dcache_policy -5.0750 4.4263 9.0%
=== Interaction Effects: solver_time_sec ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
hw_prefetcher l2_stream_detect -13.8375 9.5%
dcache_policy tlb_prefetch -13.8375 9.5%
hw_prefetcher sw_prefetch_dist 12.1125 8.4%
prefetch_threads tlb_prefetch -12.1125 8.4%
hw_prefetcher tlb_prefetch -10.8100 7.5%
sw_prefetch_dist prefetch_threads 10.8100 7.5%
l2_stream_detect dcache_policy -10.8100 7.5%
hw_prefetcher dcache_policy -9.2450 6.4%
l2_stream_detect tlb_prefetch -9.2450 6.4%
sw_prefetch_dist l2_stream_detect 8.4075 5.8%
dcache_policy prefetch_threads -8.4075 5.8%
sw_prefetch_dist dcache_policy 7.6350 5.3%
l2_stream_detect prefetch_threads -7.6350 5.3%
hw_prefetcher prefetch_threads -5.0750 3.5%
sw_prefetch_dist tlb_prefetch 5.0750 3.5%
=== Summary Statistics: solver_time_sec ===
hw_prefetcher:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 88.7550 13.1396 69.7300 110.5000
on 8 80.3475 21.4169 50.7300 109.7600
sw_prefetch_dist:
Level N Mean Std Min Max
------------------------------------------------------------
512 8 77.6325 17.8757 50.7300 102.8700
64 8 91.4700 15.5810 69.7300 110.5000
l2_stream_detect:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 90.6075 12.8466 72.7600 109.7600
on 8 78.4950 20.5634 50.7300 110.5000
dcache_policy:
Level N Mean Std Min Max
------------------------------------------------------------
write_back 8 87.0888 10.9197 72.7600 109.7600
write_through 8 82.0138 23.1898 50.7300 110.5000
prefetch_threads:
Level N Mean Std Min Max
------------------------------------------------------------
0 8 89.1737 16.2978 69.7300 110.5000
2 8 79.9287 18.9014 50.7300 101.7600
tlb_prefetch:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 88.3688 12.5660 69.7300 110.5000
on 8 80.7338 21.9205 50.7300 109.7600
=== Main Effects: cache_miss_rate ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
tlb_prefetch -2.1975 0.7905 26.7%
hw_prefetcher -2.0825 0.7905 25.3%
dcache_policy -1.9475 0.7905 23.7%
l2_stream_detect -1.1025 0.7905 13.4%
sw_prefetch_dist 0.4825 0.7905 5.9%
prefetch_threads -0.4125 0.7905 5.0%
=== Interaction Effects: cache_miss_rate ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
hw_prefetcher tlb_prefetch -3.0675 12.0%
sw_prefetch_dist prefetch_threads 3.0675 12.0%
l2_stream_detect dcache_policy -3.0675 12.0%
sw_prefetch_dist dcache_policy 2.1975 8.6%
l2_stream_detect prefetch_threads -2.1975 8.6%
sw_prefetch_dist l2_stream_detect 2.0825 8.1%
dcache_policy prefetch_threads -2.0825 8.1%
hw_prefetcher prefetch_threads -1.9475 7.6%
sw_prefetch_dist tlb_prefetch 1.9475 7.6%
hw_prefetcher sw_prefetch_dist 1.1025 4.3%
prefetch_threads tlb_prefetch -1.1025 4.3%
hw_prefetcher l2_stream_detect -0.4825 1.9%
dcache_policy tlb_prefetch -0.4825 1.9%
hw_prefetcher dcache_policy -0.4125 1.6%
l2_stream_detect tlb_prefetch -0.4125 1.6%
=== Summary Statistics: cache_miss_rate ===
hw_prefetcher:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 8.5125 2.2260 3.4600 10.5400
on 8 6.4300 3.7405 1.6200 10.3200
sw_prefetch_dist:
Level N Mean Std Min Max
------------------------------------------------------------
512 8 7.2300 3.3916 1.6200 10.3200
64 8 7.7125 3.1287 2.2700 10.5400
l2_stream_detect:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 8.0225 2.3541 2.2700 9.2700
on 8 6.9200 3.8973 1.6200 10.5400
dcache_policy:
Level N Mean Std Min Max
------------------------------------------------------------
write_back 8 8.4450 2.6124 2.2700 10.5400
write_through 8 6.4975 3.5260 1.6200 10.5400
prefetch_threads:
Level N Mean Std Min Max
------------------------------------------------------------
0 8 7.6775 3.0921 2.2700 10.5400
2 8 7.2650 3.4302 1.6200 10.5400
tlb_prefetch:
Level N Mean Std Min Max
------------------------------------------------------------
off 8 8.5700 2.2039 3.4600 10.5400
on 8 6.3725 3.7159 1.6200 10.5400
Optimization Recommendations
=== Optimization: solver_time_sec ===
Direction: minimize
Best observed run: #10
hw_prefetcher = off
sw_prefetch_dist = 512
l2_stream_detect = off
dcache_policy = write_through
prefetch_threads = 0
tlb_prefetch = on
Value: 50.73
Factor importance:
1. dcache_policy (effect: -12.0, contribution: 27.1%)
2. prefetch_threads (effect: -11.0, contribution: 24.8%)
3. l2_stream_detect (effect: -8.8, contribution: 19.8%)
4. sw_prefetch_dist (effect: 8.3, contribution: 18.6%)
5. tlb_prefetch (effect: -2.5, contribution: 5.6%)
6. hw_prefetcher (effect: 1.8, contribution: 4.1%)
=== Optimization: cache_miss_rate ===
Direction: minimize
Best observed run: #4
hw_prefetcher = on
sw_prefetch_dist = 512
l2_stream_detect = on
dcache_policy = write_through
prefetch_threads = 2
tlb_prefetch = on
Value: 1.62
Factor importance:
1. dcache_policy (effect: -2.5, contribution: 38.1%)
2. prefetch_threads (effect: -1.9, contribution: 28.6%)
3. tlb_prefetch (effect: -1.2, contribution: 19.1%)
4. sw_prefetch_dist (effect: 0.7, contribution: 11.1%)
5. l2_stream_detect (effect: 0.1, contribution: 2.0%)
6. hw_prefetcher (effect: 0.1, contribution: 1.2%)
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.9498
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
solver_time_sec |
1.5 |
|
51.25 0.9466 51.25 seconds |
↓ |
cache_miss_rate |
1.0 |
|
1.62 0.9545 1.62 % |
↓ |
Recommended Settings
| Factor | Value |
hw_prefetcher | off |
sw_prefetch_dist | 512 bytes |
l2_stream_detect | off |
dcache_policy | write_back |
prefetch_threads | 2 |
tlb_prefetch | off |
Source: from observed run #4
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
cache_miss_rate | 1.62 | 1.62 | +0.00 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #10 | 0.9376 | hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off |
| #9 | 0.7155 | hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off |
Model Quality
| Response | R² | Type |
cache_miss_rate | 0.2542 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.9498
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
solver_time_sec 1.5 0.9466 51.25 seconds ↓
cache_miss_rate 1.0 0.9545 1.62 % ↓
Recommended settings:
hw_prefetcher = off
sw_prefetch_dist = 512 bytes
l2_stream_detect = off
dcache_policy = write_back
prefetch_threads = 2
tlb_prefetch = off
(from observed run #4)
Trade-off summary:
solver_time_sec: 51.25 (best observed: 50.73, sacrifice: +0.52)
cache_miss_rate: 1.62 (best observed: 1.62, sacrifice: +0.00)
Model quality:
solver_time_sec: R² = 0.2763 (linear)
cache_miss_rate: R² = 0.2542 (linear)
Top 3 observed runs by overall desirability:
1. Run #4 (D=0.9498): hw_prefetcher=off, sw_prefetch_dist=512, l2_stream_detect=off, dcache_policy=write_back, prefetch_threads=2, tlb_prefetch=off
2. Run #10 (D=0.9376): hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off
3. Run #9 (D=0.7155): hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off