← All Use Cases
📥
Plackett-Burman

Hardware & Software Prefetch Tuning

Identify critical prefetch parameters for memory-bound CFD workloads on modern HPC processors.

Summary

This experiment investigates hardware & software prefetch tuning. Plackett-Burman screening design to identify critical prefetch parameters for memory-bound CFD workloads.

The design varies 6 factors: hw prefetcher, ranging from off to on, sw prefetch dist (bytes), ranging from 64 to 512, l2 stream detect, ranging from off to on, dcache policy, ranging from write_back to write_through, prefetch threads, ranging from 0 to 2, and tlb prefetch, ranging from off to on. The goal is to optimize 2 responses: solver time sec (seconds) (minimize) and cache miss rate (%) (minimize). Fixed conditions held constant across all runs include processor = Xeon_8490H, workload = CFD_structured_grid, grid points = 256M, iterations = 100.

A Plackett-Burman screening design was used to efficiently test 6 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.

Key Findings

For solver time sec, the most influential factors were sw prefetch dist (38.3%), prefetch threads (23.1%), dcache policy (17.8%). The best observed value was 50.73 (at hw prefetcher = on, sw prefetch dist = 64, l2 stream detect = off).

For cache miss rate, the most influential factors were sw prefetch dist (41.6%), dcache policy (21.7%), prefetch threads (20.0%). The best observed value was 1.62 (at hw prefetcher = on, sw prefetch dist = 512, l2 stream detect = on).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
hw_prefetcheroff / oncategorical
sw_prefetch_dist64 – 512continuousbytes
l2_stream_detectoff / oncategorical
dcache_policywrite_back / write_throughcategorical
prefetch_threads0 – 2continuous
tlb_prefetchoff / oncategorical

Fixed: processor = Xeon 8490H, workload = CFD structured grid, grid_points = 256M, iterations = 100

Responses

ResponseDirectionUnit
solver_time_sec↓ minimizeseconds
cache_miss_rate↓ minimize%

Experimental Matrix

The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.

RunBlockhw_prefetchersw_prefetch_distl2_stream_detectdcache_policyprefetch_threadstlb_prefetch
11on512onwrite_back0off
21off64onwrite_through0off
31off512offwrite_through0on
41on512onwrite_through2on
51off512offwrite_back2off
61on64offwrite_through2off
71off64onwrite_back2on
81on64offwrite_back0on
92on512onwrite_back0off
102on512onwrite_through2on
112off512offwrite_back2off
122off64onwrite_back2on
132off64onwrite_through0off
142off512offwrite_through0on
152on64offwrite_through2off
162on64offwrite_back0on

How to Run

terminal
$ doe info --config use_cases/22_prefetch_strategy/config.json $ doe generate --config use_cases/22_prefetch_strategy/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/22_prefetch_strategy/config.json $ doe optimize --config use_cases/22_prefetch_strategy/config.json $ doe report --config use_cases/22_prefetch_strategy/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: solver_time_sec

Pareto Chart

Pareto chart for solver_time_sec

Main Effects Plot

Main effects plot for solver_time_sec

Response: cache_miss_rate

Pareto Chart

Pareto chart for cache_miss_rate

Main Effects Plot

Main effects plot for cache_miss_rate

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

solver_time: sw_prefetch_dist vs prefetch_threads

RSM surface: solver_time: sw_prefetch_dist vs prefetch_threads

cache_miss_rate: sw_prefetch_dist vs prefetch_threads

RSM surface: cache_miss_rate: sw_prefetch_dist vs prefetch_threads

Full Analysis Output

doe analyze
=== Main Effects: solver_time_sec === Factor Effect Std Error % Contribution -------------------------------------------------------------- sw_prefetch_dist 13.8375 4.4263 24.6% l2_stream_detect -12.1125 4.4263 21.5% prefetch_threads -9.2450 4.4263 16.4% hw_prefetcher -8.4075 4.4263 14.9% tlb_prefetch -7.6350 4.4263 13.6% dcache_policy -5.0750 4.4263 9.0% === Interaction Effects: solver_time_sec === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ hw_prefetcher l2_stream_detect -13.8375 9.5% dcache_policy tlb_prefetch -13.8375 9.5% hw_prefetcher sw_prefetch_dist 12.1125 8.4% prefetch_threads tlb_prefetch -12.1125 8.4% hw_prefetcher tlb_prefetch -10.8100 7.5% sw_prefetch_dist prefetch_threads 10.8100 7.5% l2_stream_detect dcache_policy -10.8100 7.5% hw_prefetcher dcache_policy -9.2450 6.4% l2_stream_detect tlb_prefetch -9.2450 6.4% sw_prefetch_dist l2_stream_detect 8.4075 5.8% dcache_policy prefetch_threads -8.4075 5.8% sw_prefetch_dist dcache_policy 7.6350 5.3% l2_stream_detect prefetch_threads -7.6350 5.3% hw_prefetcher prefetch_threads -5.0750 3.5% sw_prefetch_dist tlb_prefetch 5.0750 3.5% === Summary Statistics: solver_time_sec === hw_prefetcher: Level N Mean Std Min Max ------------------------------------------------------------ off 8 88.7550 13.1396 69.7300 110.5000 on 8 80.3475 21.4169 50.7300 109.7600 sw_prefetch_dist: Level N Mean Std Min Max ------------------------------------------------------------ 512 8 77.6325 17.8757 50.7300 102.8700 64 8 91.4700 15.5810 69.7300 110.5000 l2_stream_detect: Level N Mean Std Min Max ------------------------------------------------------------ off 8 90.6075 12.8466 72.7600 109.7600 on 8 78.4950 20.5634 50.7300 110.5000 dcache_policy: Level N Mean Std Min Max ------------------------------------------------------------ write_back 8 87.0888 10.9197 72.7600 109.7600 write_through 8 82.0138 23.1898 50.7300 110.5000 prefetch_threads: Level N Mean Std Min Max ------------------------------------------------------------ 0 8 89.1737 16.2978 69.7300 110.5000 2 8 79.9287 18.9014 50.7300 101.7600 tlb_prefetch: Level N Mean Std Min Max ------------------------------------------------------------ off 8 88.3688 12.5660 69.7300 110.5000 on 8 80.7338 21.9205 50.7300 109.7600 === Main Effects: cache_miss_rate === Factor Effect Std Error % Contribution -------------------------------------------------------------- tlb_prefetch -2.1975 0.7905 26.7% hw_prefetcher -2.0825 0.7905 25.3% dcache_policy -1.9475 0.7905 23.7% l2_stream_detect -1.1025 0.7905 13.4% sw_prefetch_dist 0.4825 0.7905 5.9% prefetch_threads -0.4125 0.7905 5.0% === Interaction Effects: cache_miss_rate === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ hw_prefetcher tlb_prefetch -3.0675 12.0% sw_prefetch_dist prefetch_threads 3.0675 12.0% l2_stream_detect dcache_policy -3.0675 12.0% sw_prefetch_dist dcache_policy 2.1975 8.6% l2_stream_detect prefetch_threads -2.1975 8.6% sw_prefetch_dist l2_stream_detect 2.0825 8.1% dcache_policy prefetch_threads -2.0825 8.1% hw_prefetcher prefetch_threads -1.9475 7.6% sw_prefetch_dist tlb_prefetch 1.9475 7.6% hw_prefetcher sw_prefetch_dist 1.1025 4.3% prefetch_threads tlb_prefetch -1.1025 4.3% hw_prefetcher l2_stream_detect -0.4825 1.9% dcache_policy tlb_prefetch -0.4825 1.9% hw_prefetcher dcache_policy -0.4125 1.6% l2_stream_detect tlb_prefetch -0.4125 1.6% === Summary Statistics: cache_miss_rate === hw_prefetcher: Level N Mean Std Min Max ------------------------------------------------------------ off 8 8.5125 2.2260 3.4600 10.5400 on 8 6.4300 3.7405 1.6200 10.3200 sw_prefetch_dist: Level N Mean Std Min Max ------------------------------------------------------------ 512 8 7.2300 3.3916 1.6200 10.3200 64 8 7.7125 3.1287 2.2700 10.5400 l2_stream_detect: Level N Mean Std Min Max ------------------------------------------------------------ off 8 8.0225 2.3541 2.2700 9.2700 on 8 6.9200 3.8973 1.6200 10.5400 dcache_policy: Level N Mean Std Min Max ------------------------------------------------------------ write_back 8 8.4450 2.6124 2.2700 10.5400 write_through 8 6.4975 3.5260 1.6200 10.5400 prefetch_threads: Level N Mean Std Min Max ------------------------------------------------------------ 0 8 7.6775 3.0921 2.2700 10.5400 2 8 7.2650 3.4302 1.6200 10.5400 tlb_prefetch: Level N Mean Std Min Max ------------------------------------------------------------ off 8 8.5700 2.2039 3.4600 10.5400 on 8 6.3725 3.7159 1.6200 10.5400

Optimization Recommendations

doe optimize
=== Optimization: solver_time_sec === Direction: minimize Best observed run: #10 hw_prefetcher = off sw_prefetch_dist = 512 l2_stream_detect = off dcache_policy = write_through prefetch_threads = 0 tlb_prefetch = on Value: 50.73 Factor importance: 1. dcache_policy (effect: -12.0, contribution: 27.1%) 2. prefetch_threads (effect: -11.0, contribution: 24.8%) 3. l2_stream_detect (effect: -8.8, contribution: 19.8%) 4. sw_prefetch_dist (effect: 8.3, contribution: 18.6%) 5. tlb_prefetch (effect: -2.5, contribution: 5.6%) 6. hw_prefetcher (effect: 1.8, contribution: 4.1%) === Optimization: cache_miss_rate === Direction: minimize Best observed run: #4 hw_prefetcher = on sw_prefetch_dist = 512 l2_stream_detect = on dcache_policy = write_through prefetch_threads = 2 tlb_prefetch = on Value: 1.62 Factor importance: 1. dcache_policy (effect: -2.5, contribution: 38.1%) 2. prefetch_threads (effect: -1.9, contribution: 28.6%) 3. tlb_prefetch (effect: -1.2, contribution: 19.1%) 4. sw_prefetch_dist (effect: 0.7, contribution: 11.1%) 5. l2_stream_detect (effect: 0.1, contribution: 2.0%) 6. hw_prefetcher (effect: 0.1, contribution: 1.2%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9498

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
solver_time_sec 1.5
0.9466
51.25 0.9466 51.25 seconds
cache_miss_rate 1.0
0.9545
1.62 0.9545 1.62 %

Recommended Settings

FactorValue
hw_prefetcheroff
sw_prefetch_dist512 bytes
l2_stream_detectoff
dcache_policywrite_back
prefetch_threads2
tlb_prefetchoff

Source: from observed run #4

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
cache_miss_rate1.621.62+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#100.9376hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off
#90.7155hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off

Model Quality

ResponseType
cache_miss_rate0.2542linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9498 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- solver_time_sec 1.5 0.9466 51.25 seconds ↓ cache_miss_rate 1.0 0.9545 1.62 % ↓ Recommended settings: hw_prefetcher = off sw_prefetch_dist = 512 bytes l2_stream_detect = off dcache_policy = write_back prefetch_threads = 2 tlb_prefetch = off (from observed run #4) Trade-off summary: solver_time_sec: 51.25 (best observed: 50.73, sacrifice: +0.52) cache_miss_rate: 1.62 (best observed: 1.62, sacrifice: +0.00) Model quality: solver_time_sec: R² = 0.2763 (linear) cache_miss_rate: R² = 0.2542 (linear) Top 3 observed runs by overall desirability: 1. Run #4 (D=0.9498): hw_prefetcher=off, sw_prefetch_dist=512, l2_stream_detect=off, dcache_policy=write_back, prefetch_threads=2, tlb_prefetch=off 2. Run #10 (D=0.9376): hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off 3. Run #9 (D=0.7155): hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off
← Previous: Interconnect Topology & Adaptive Routing Next: GPU Compute-Communication Overlap →