Hardware & Software Prefetch Tuning

Summary

This experiment investigates hardware & software prefetch tuning. Plackett-Burman screening design to identify critical prefetch parameters for memory-bound CFD workloads.

The design varies 6 factors: hw prefetcher, ranging from off to on, sw prefetch dist (bytes), ranging from 64 to 512, l2 stream detect, ranging from off to on, dcache policy, ranging from write_back to write_through, prefetch threads, ranging from 0 to 2, and tlb prefetch, ranging from off to on. The goal is to optimize 2 responses: solver time sec (seconds) (minimize) and cache miss rate (%) (minimize). Fixed conditions held constant across all runs include processor = Xeon_8490H, workload = CFD_structured_grid, grid points = 256M, iterations = 100.

A Plackett-Burman screening design was used to efficiently test 6 factors in only 16 runs. This design assumes interactions are negligible and focuses on identifying the most influential main effects.

Key Findings

For solver time sec, the most influential factors were sw prefetch dist (38.3%), prefetch threads (23.1%), dcache policy (17.8%). The best observed value was 50.73 (at hw prefetcher = on, sw prefetch dist = 64, l2 stream detect = off).

For cache miss rate, the most influential factors were sw prefetch dist (41.6%), dcache policy (21.7%), prefetch threads (20.0%). The best observed value was 1.62 (at hw prefetcher = on, sw prefetch dist = 512, l2 stream detect = on).

Recommended Next Steps

Follow up with a response surface design (CCD or Box-Behnken) on the top 3–4 factors to model curvature and find the true optimum.
Consider whether any fixed factors should be varied in a future study.
The screening results can guide factor reduction — drop factors contributing less than 5% and re-run with a smaller, more focused design.

Experimental Setup

Factors

Factor	Levels	Type	Unit
hw_prefetcher	off / on	categorical	—
sw_prefetch_dist	64 – 512	continuous	bytes
l2_stream_detect	off / on	categorical	—
dcache_policy	write_back / write_through	categorical	—
prefetch_threads	0 – 2	continuous	—
tlb_prefetch	off / on	categorical	—

Fixed: processor = Xeon 8490H, workload = CFD structured grid, grid_points = 256M, iterations = 100

Responses

Response	Direction	Unit
solver_time_sec	↓ minimize	seconds
cache_miss_rate	↓ minimize	%

Experimental Matrix

The Plackett-Burman Design produces 16 runs. Each row is one experiment with specific factor settings.

Run	Block	`hw_prefetcher`	`sw_prefetch_dist`	`l2_stream_detect`	`dcache_policy`	`prefetch_threads`	`tlb_prefetch`
1	1	on	512	on	write_back	0	off
2	1	off	64	on	write_through	0	off
3	1	off	512	off	write_through	0	on
4	1	on	512	on	write_through	2	on
5	1	off	512	off	write_back	2	off
6	1	on	64	off	write_through	2	off
7	1	off	64	on	write_back	2	on
8	1	on	64	off	write_back	0	on
9	2	on	512	on	write_back	0	off
10	2	on	512	on	write_through	2	on
11	2	off	512	off	write_back	2	off
12	2	off	64	on	write_back	2	on
13	2	off	64	on	write_through	0	off
14	2	off	512	off	write_through	0	on
15	2	on	64	off	write_through	2	off
16	2	on	64	off	write_back	0	on

How to Run

terminal
$ doe info --config use_cases/22_prefetch_strategy/config.json
$ doe generate --config use_cases/22_prefetch_strategy/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/22_prefetch_strategy/config.json
$ doe optimize --config use_cases/22_prefetch_strategy/config.json
$ doe report --config use_cases/22_prefetch_strategy/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: solver_time_sec

Pareto Chart

Main Effects Plot

Response: cache_miss_rate

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

solver_time: sw_prefetch_dist vs prefetch_threads

cache_miss_rate: sw_prefetch_dist vs prefetch_threads

Full Analysis Output

doe analyze
=== Main Effects: solver_time_sec ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
sw_prefetch_dist        13.8375       4.4263            24.6%
l2_stream_detect       -12.1125       4.4263            21.5%
prefetch_threads        -9.2450       4.4263            16.4%
hw_prefetcher           -8.4075       4.4263            14.9%
tlb_prefetch            -7.6350       4.4263            13.6%
dcache_policy           -5.0750       4.4263             9.0%

=== Interaction Effects: solver_time_sec ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
hw_prefetcher        l2_stream_detect         -13.8375             9.5%
dcache_policy        tlb_prefetch             -13.8375             9.5%
hw_prefetcher        sw_prefetch_dist          12.1125             8.4%
prefetch_threads     tlb_prefetch             -12.1125             8.4%
hw_prefetcher        tlb_prefetch             -10.8100             7.5%
sw_prefetch_dist     prefetch_threads          10.8100             7.5%
l2_stream_detect     dcache_policy            -10.8100             7.5%
hw_prefetcher        dcache_policy             -9.2450             6.4%
l2_stream_detect     tlb_prefetch              -9.2450             6.4%
sw_prefetch_dist     l2_stream_detect           8.4075             5.8%
dcache_policy        prefetch_threads          -8.4075             5.8%
sw_prefetch_dist     dcache_policy              7.6350             5.3%
l2_stream_detect     prefetch_threads          -7.6350             5.3%
hw_prefetcher        prefetch_threads          -5.0750             3.5%
sw_prefetch_dist     tlb_prefetch               5.0750             3.5%

=== Summary Statistics: solver_time_sec ===

hw_prefetcher:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    88.7550    13.1396    69.7300   110.5000
  on                  8    80.3475    21.4169    50.7300   109.7600

sw_prefetch_dist:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  512                 8    77.6325    17.8757    50.7300   102.8700
  64                  8    91.4700    15.5810    69.7300   110.5000

l2_stream_detect:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    90.6075    12.8466    72.7600   109.7600
  on                  8    78.4950    20.5634    50.7300   110.5000

dcache_policy:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  write_back          8    87.0888    10.9197    72.7600   109.7600
  write_through       8    82.0138    23.1898    50.7300   110.5000

prefetch_threads:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0                   8    89.1737    16.2978    69.7300   110.5000
  2                   8    79.9287    18.9014    50.7300   101.7600

tlb_prefetch:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8    88.3688    12.5660    69.7300   110.5000
  on                  8    80.7338    21.9205    50.7300   109.7600

=== Main Effects: cache_miss_rate ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
tlb_prefetch            -2.1975       0.7905            26.7%
hw_prefetcher           -2.0825       0.7905            25.3%
dcache_policy           -1.9475       0.7905            23.7%
l2_stream_detect        -1.1025       0.7905            13.4%
sw_prefetch_dist         0.4825       0.7905             5.9%
prefetch_threads        -0.4125       0.7905             5.0%

=== Interaction Effects: cache_miss_rate ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
hw_prefetcher        tlb_prefetch              -3.0675            12.0%
sw_prefetch_dist     prefetch_threads           3.0675            12.0%
l2_stream_detect     dcache_policy             -3.0675            12.0%
sw_prefetch_dist     dcache_policy              2.1975             8.6%
l2_stream_detect     prefetch_threads          -2.1975             8.6%
sw_prefetch_dist     l2_stream_detect           2.0825             8.1%
dcache_policy        prefetch_threads          -2.0825             8.1%
hw_prefetcher        prefetch_threads          -1.9475             7.6%
sw_prefetch_dist     tlb_prefetch               1.9475             7.6%
hw_prefetcher        sw_prefetch_dist           1.1025             4.3%
prefetch_threads     tlb_prefetch              -1.1025             4.3%
hw_prefetcher        l2_stream_detect          -0.4825             1.9%
dcache_policy        tlb_prefetch              -0.4825             1.9%
hw_prefetcher        dcache_policy             -0.4125             1.6%
l2_stream_detect     tlb_prefetch              -0.4125             1.6%

=== Summary Statistics: cache_miss_rate ===

hw_prefetcher:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8     8.5125     2.2260     3.4600    10.5400
  on                  8     6.4300     3.7405     1.6200    10.3200

sw_prefetch_dist:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  512                 8     7.2300     3.3916     1.6200    10.3200
  64                  8     7.7125     3.1287     2.2700    10.5400

l2_stream_detect:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8     8.0225     2.3541     2.2700     9.2700
  on                  8     6.9200     3.8973     1.6200    10.5400

dcache_policy:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  write_back          8     8.4450     2.6124     2.2700    10.5400
  write_through       8     6.4975     3.5260     1.6200    10.5400

prefetch_threads:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0                   8     7.6775     3.0921     2.2700    10.5400
  2                   8     7.2650     3.4302     1.6200    10.5400

tlb_prefetch:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8     8.5700     2.2039     3.4600    10.5400
  on                  8     6.3725     3.7159     1.6200    10.5400

Optimization Recommendations

doe optimize
=== Optimization: solver_time_sec ===
Direction: minimize

Best observed run: #10
  hw_prefetcher = off
  sw_prefetch_dist = 512
  l2_stream_detect = off
  dcache_policy = write_through
  prefetch_threads = 0
  tlb_prefetch = on
  Value: 50.73

Factor importance:
  1. dcache_policy  (effect: -12.0, contribution: 27.1%)
  2. prefetch_threads  (effect: -11.0, contribution: 24.8%)
  3. l2_stream_detect  (effect: -8.8, contribution: 19.8%)
  4. sw_prefetch_dist  (effect: 8.3, contribution: 18.6%)
  5. tlb_prefetch  (effect: -2.5, contribution: 5.6%)
  6. hw_prefetcher  (effect: 1.8, contribution: 4.1%)

=== Optimization: cache_miss_rate ===
Direction: minimize

Best observed run: #4
  hw_prefetcher = on
  sw_prefetch_dist = 512
  l2_stream_detect = on
  dcache_policy = write_through
  prefetch_threads = 2
  tlb_prefetch = on
  Value: 1.62

Factor importance:
  1. dcache_policy  (effect: -2.5, contribution: 38.1%)
  2. prefetch_threads  (effect: -1.9, contribution: 28.6%)
  3. tlb_prefetch  (effect: -1.2, contribution: 19.1%)
  4. sw_prefetch_dist  (effect: 0.7, contribution: 11.1%)
  5. l2_stream_detect  (effect: 0.1, contribution: 2.0%)
  6. hw_prefetcher  (effect: 0.1, contribution: 1.2%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.9498

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`solver_time_sec`	1.5	0.9466	51.25 0.9466 51.25 seconds	↓
`cache_miss_rate`	1.0	0.9545	1.62 0.9545 1.62 %	↓

Recommended Settings

Factor	Value
`hw_prefetcher`	off
`sw_prefetch_dist`	512 bytes
`l2_stream_detect`	off
`dcache_policy`	write_back
`prefetch_threads`	2
`tlb_prefetch`	off

Source: from observed run #4

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`cache_miss_rate`	1.62	1.62	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#10	0.9376	hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off
#9	0.7155	hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off

Model Quality

Response	R²	Type
`cache_miss_rate`	0.2542	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.9498

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
solver_time_sec              1.5       0.9466       51.25 seconds   ↓
cache_miss_rate              1.0       0.9545        1.62 %   ↓

Recommended settings:
  hw_prefetcher = off
  sw_prefetch_dist = 512 bytes
  l2_stream_detect = off
  dcache_policy = write_back
  prefetch_threads = 2
  tlb_prefetch = off
  (from observed run #4)

Trade-off summary:
  solver_time_sec: 51.25 (best observed: 50.73, sacrifice: +0.52)
  cache_miss_rate: 1.62 (best observed: 1.62, sacrifice: +0.00)

Model quality:
  solver_time_sec: R² = 0.2763 (linear)
  cache_miss_rate: R² = 0.2542 (linear)

Top 3 observed runs by overall desirability:
  1. Run #4 (D=0.9498): hw_prefetcher=off, sw_prefetch_dist=512, l2_stream_detect=off, dcache_policy=write_back, prefetch_threads=2, tlb_prefetch=off
  2. Run #10 (D=0.9376): hw_prefetcher=on, sw_prefetch_dist=64, l2_stream_detect=off, dcache_policy=write_through, prefetch_threads=2, tlb_prefetch=off
  3. Run #9 (D=0.7155): hw_prefetcher=on, sw_prefetch_dist=512, l2_stream_detect=on, dcache_policy=write_back, prefetch_threads=0, tlb_prefetch=off