← All Use Cases
🔀
Full Factorial

CPU Cross-NUMA Bandwidth

Characterize CPU-to-CPU memory bandwidth and latency across NUMA domains on a 4-socket HPC node.

Summary

This experiment investigates cpu cross-numa bandwidth. Full factorial design to characterize CPU-to-CPU memory bandwidth and latency across NUMA domains on a 4-socket HPC node.

The design varies 4 factors: numa hop, ranging from local to far, transfer mode, ranging from memcpy to streaming_store, thread count (threads), ranging from 1 to 14, and buffer size (bytes), ranging from 1048576 to 268435456. The goal is to optimize 2 responses: bandwidth GBs (GB/s) (maximize) and latency ns (ns) (minimize). Fixed conditions held constant across all runs include sockets = 4, cores per socket = 28, numa distance near = 21, numa distance far = 31.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (24 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For bandwidth GBs, the most influential factors were numa hop (54.3%), thread count (33.1%), transfer mode (7.0%). The best observed value was 316.2 (at numa hop = local, transfer mode = memcpy, thread count = 14).

For latency ns, the most influential factors were transfer mode (53.3%), thread count (28.1%), numa hop (17.3%). The best observed value was 61.5 (at numa hop = local, transfer mode = streaming_store, thread count = 1).

Recommended Next Steps

Experimental Setup

Factors

FactorLevelsTypeUnit
numa_hoplocal, near, farcategorical
transfer_modememcpy, streaming_storecategorical
thread_count1, 14continuousthreads
buffer_size1048576, 268435456continuousbytes

Fixed: sockets = 4, cores_per_socket = 28, numa_distance_near = 21, numa_distance_far = 31

Responses

ResponseDirectionUnit
bandwidth_GBs↑ maximizeGB/s
latency_ns↓ minimizens

Experimental Matrix

The Full Factorial Design produces 24 runs. Each row is one experiment with specific factor settings.

Runnuma_hoptransfer_modethread_countbuffer_size
1nearstreaming_store14268435456
2localstreaming_store1268435456
3farmemcpy141048576
4farstreaming_store14268435456
5nearstreaming_store1268435456
6nearmemcpy141048576
7farmemcpy1268435456
8nearmemcpy14268435456
9nearstreaming_store141048576
10farmemcpy11048576
11localmemcpy1268435456
12nearstreaming_store11048576
13farstreaming_store1268435456
14localstreaming_store141048576
15nearmemcpy1268435456
16localmemcpy141048576
17farstreaming_store141048576
18localstreaming_store11048576
19farmemcpy14268435456
20localstreaming_store14268435456
21nearmemcpy11048576
22localmemcpy11048576
23localmemcpy14268435456
24farstreaming_store11048576

How to Run

terminal
$ doe info --config use_cases/17_cpu_cross_numa_bandwidth/config.json $ doe generate --config use_cases/17_cpu_cross_numa_bandwidth/config.json --output results/run.sh --seed 42 $ bash results/run.sh $ doe analyze --config use_cases/17_cpu_cross_numa_bandwidth/config.json $ doe optimize --config use_cases/17_cpu_cross_numa_bandwidth/config.json $ doe optimize --config use_cases/17_cpu_cross_numa_bandwidth/config.json --multi # multi-objective $ doe report --config use_cases/17_cpu_cross_numa_bandwidth/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: bandwidth_GBs

Pareto Chart

Pareto chart for bandwidth_GBs

Main Effects Plot

Main effects plot for bandwidth_GBs

Response: latency_ns

Pareto Chart

Pareto chart for latency_ns

Main Effects Plot

Main effects plot for latency_ns

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

bandwidth_GBs: thread_count vs buffer_size

RSM surface: bandwidth_GBs: thread_count vs buffer_size

latency_ns: thread_count vs buffer_size

RSM surface: latency_ns: thread_count vs buffer_size

Full Analysis Output

doe analyze
=== Main Effects: bandwidth_GBs === Factor Effect Std Error % Contribution -------------------------------------------------------------- numa_hop 66.7375 20.9093 43.3% buffer_size -59.2917 20.9093 38.5% thread_count -16.8250 20.9093 10.9% transfer_mode 11.2250 20.9093 7.3% === Interaction Effects: bandwidth_GBs === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ transfer_mode buffer_size 53.1583 67.9% transfer_mode thread_count 13.2583 16.9% thread_count buffer_size 11.9083 15.2% === Summary Statistics: bandwidth_GBs === numa_hop: Level N Mean Std Min Max ------------------------------------------------------------ far 8 162.7625 94.4747 19.8000 297.6000 local 8 112.4250 110.3910 15.8000 316.2000 near 8 96.0250 102.9568 30.8000 271.9000 transfer_mode: Level N Mean Std Min Max ------------------------------------------------------------ memcpy 12 118.1250 111.6678 15.8000 316.2000 streaming_store 12 129.3500 96.9587 19.8000 297.6000 thread_count: Level N Mean Std Min Max ------------------------------------------------------------ 1 12 132.1500 97.0430 29.9000 316.2000 14 12 115.3250 111.2100 15.8000 297.6000 buffer_size: Level N Mean Std Min Max ------------------------------------------------------------ 1048576 12 153.3833 112.8749 15.8000 316.2000 268435456 12 94.0917 85.3295 29.7000 297.6000 === Main Effects: latency_ns === Factor Effect Std Error % Contribution -------------------------------------------------------------- numa_hop 58.3750 10.4980 78.9% buffer_size 8.4917 10.4980 11.5% thread_count -3.8250 10.4980 5.2% transfer_mode -3.2750 10.4980 4.4% === Interaction Effects: latency_ns === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ transfer_mode thread_count -12.8250 40.2% thread_count buffer_size -10.3917 32.6% transfer_mode buffer_size 8.6583 27.2% === Summary Statistics: latency_ns === numa_hop: Level N Mean Std Min Max ------------------------------------------------------------ far 8 158.7500 42.0005 83.0000 202.0000 local 8 141.0125 57.9953 68.6000 212.3000 near 8 100.3750 38.9371 61.5000 180.2000 transfer_mode: Level N Mean Std Min Max ------------------------------------------------------------ memcpy 12 135.0167 47.3819 65.8000 200.5000 streaming_store 12 131.7417 57.2674 61.5000 212.3000 thread_count: Level N Mean Std Min Max ------------------------------------------------------------ 1 12 135.2917 53.3590 65.8000 212.3000 14 12 131.4667 51.7232 61.5000 205.1000 buffer_size: Level N Mean Std Min Max ------------------------------------------------------------ 1048576 12 129.1333 52.7718 61.5000 202.0000 268435456 12 137.6250 52.0217 65.8000 212.3000

Optimization Recommendations

doe optimize
=== Optimization: bandwidth_GBs === Direction: maximize Best observed run: #14 numa_hop = local transfer_mode = memcpy thread_count = 14 buffer_size = 268435456 Value: 316.2 RSM Model (linear, R² = 0.0488, Adj R² = -0.1514): Coefficients: intercept: +123.7375 numa_hop: -13.7938 transfer_mode: +9.9542 thread_count: -7.8542 buffer_size: +14.2625 Predicted optimum: numa_hop = far transfer_mode = streaming_store thread_count = 1 buffer_size = 268435456 Predicted value: 169.6021 Factor importance: 1. buffer_size (effect: 28.5, contribution: 31.1%) 2. numa_hop (effect: 27.6, contribution: 30.1%) 3. transfer_mode (effect: 19.9, contribution: 21.7%) 4. thread_count (effect: -15.7, contribution: 17.1%) === Optimization: latency_ns === Direction: minimize Best observed run: #18 numa_hop = near transfer_mode = streaming_store thread_count = 1 buffer_size = 1048576 Value: 61.5 RSM Model (linear, R² = 0.2038, Adj R² = 0.0362): Coefficients: intercept: +133.3792 numa_hop: -25.4563 transfer_mode: +1.2792 thread_count: -0.8292 buffer_size: -9.0708 Predicted optimum: numa_hop = far transfer_mode = streaming_store thread_count = 1 buffer_size = 1048576 Predicted value: 170.0146 Factor importance: 1. numa_hop (effect: 50.9, contribution: 69.5%) 2. buffer_size (effect: -18.1, contribution: 24.8%) 3. transfer_mode (effect: 2.6, contribution: 3.5%) 4. thread_count (effect: -1.7, contribution: 2.3%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9204

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
bandwidth_GBs 1.5
0.9545
316.20 0.9545 316.20 GB/s
latency_ns 1.0
0.8714
75.30 0.8714 75.30 ns

Recommended Settings

FactorValue
numa_hoplocal
transfer_modememcpy
thread_count14 threads
buffer_size268435456 bytes

Source: from observed run #14

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
latency_ns75.3061.50+13.80

Top 3 Runs by Desirability

RunDFactor Settings
#200.8682numa_hop=far, transfer_mode=memcpy, thread_count=14, buffer_size=1048576
#160.8108numa_hop=near, transfer_mode=memcpy, thread_count=14, buffer_size=1048576

Model Quality

ResponseType
latency_ns0.2229linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9204 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- bandwidth_GBs 1.5 0.9545 316.20 GB/s ↑ latency_ns 1.0 0.8714 75.30 ns ↓ Recommended settings: numa_hop = local transfer_mode = memcpy thread_count = 14 threads buffer_size = 268435456 bytes (from observed run #14) Trade-off summary: bandwidth_GBs: 316.20 (best observed: 316.20, sacrifice: +0.00) latency_ns: 75.30 (best observed: 61.50, sacrifice: +13.80) Model quality: bandwidth_GBs: R² = 0.3167 (linear) latency_ns: R² = 0.2229 (linear) Top 3 observed runs by overall desirability: 1. Run #14 (D=0.9204): numa_hop=local, transfer_mode=memcpy, thread_count=14, buffer_size=268435456 2. Run #20 (D=0.8682): numa_hop=far, transfer_mode=memcpy, thread_count=14, buffer_size=1048576 3. Run #16 (D=0.8108): numa_hop=near, transfer_mode=memcpy, thread_count=14, buffer_size=1048576
← Previous: Distributed Deep Learning Scaling Interconnect Topology & Routing →