CPU Cross-NUMA Bandwidth

Summary

This experiment investigates cpu cross-numa bandwidth. Full factorial design to characterize CPU-to-CPU memory bandwidth and latency across NUMA domains on a 4-socket HPC node.

The design varies 4 factors: numa hop, ranging from local to far, transfer mode, ranging from memcpy to streaming_store, thread count (threads), ranging from 1 to 14, and buffer size (bytes), ranging from 1048576 to 268435456. The goal is to optimize 2 responses: bandwidth GBs (GB/s) (maximize) and latency ns (ns) (minimize). Fixed conditions held constant across all runs include sockets = 4, cores per socket = 28, numa distance near = 21, numa distance far = 31.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (24 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For bandwidth GBs, the most influential factors were numa hop (54.3%), thread count (33.1%), transfer mode (7.0%). The best observed value was 316.2 (at numa hop = local, transfer mode = memcpy, thread count = 14).

For latency ns, the most influential factors were transfer mode (53.3%), thread count (28.1%), numa hop (17.3%). The best observed value was 61.5 (at numa hop = local, transfer mode = streaming_store, thread count = 1).

Recommended Next Steps

Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Levels	Type	Unit
`numa_hop`	local, near, far	categorical
`transfer_mode`	memcpy, streaming_store	categorical
`thread_count`	1, 14	continuous	threads
`buffer_size`	1048576, 268435456	continuous	bytes

Fixed: sockets = 4, cores_per_socket = 28, numa_distance_near = 21, numa_distance_far = 31

Responses

Response	Direction	Unit
`bandwidth_GBs`	↑ maximize	GB/s
`latency_ns`	↓ minimize	ns

Experimental Matrix

The Full Factorial Design produces 24 runs. Each row is one experiment with specific factor settings.

Run	`numa_hop`	`transfer_mode`	`thread_count`	`buffer_size`
1	near	streaming_store	14	268435456
2	local	streaming_store	1	268435456
3	far	memcpy	14	1048576
4	far	streaming_store	14	268435456
5	near	streaming_store	1	268435456
6	near	memcpy	14	1048576
7	far	memcpy	1	268435456
8	near	memcpy	14	268435456
9	near	streaming_store	14	1048576
10	far	memcpy	1	1048576
11	local	memcpy	1	268435456
12	near	streaming_store	1	1048576
13	far	streaming_store	1	268435456
14	local	streaming_store	14	1048576
15	near	memcpy	1	268435456
16	local	memcpy	14	1048576
17	far	streaming_store	14	1048576
18	local	streaming_store	1	1048576
19	far	memcpy	14	268435456
20	local	streaming_store	14	268435456
21	near	memcpy	1	1048576
22	local	memcpy	1	1048576
23	local	memcpy	14	268435456
24	far	streaming_store	1	1048576

How to Run

terminal
$ doe info --config use_cases/17_cpu_cross_numa_bandwidth/config.json
$ doe generate --config use_cases/17_cpu_cross_numa_bandwidth/config.json --output results/run.sh --seed 42
$ bash results/run.sh
$ doe analyze --config use_cases/17_cpu_cross_numa_bandwidth/config.json
$ doe optimize --config use_cases/17_cpu_cross_numa_bandwidth/config.json
$ doe optimize --config use_cases/17_cpu_cross_numa_bandwidth/config.json --multi  # multi-objective
$ doe report --config use_cases/17_cpu_cross_numa_bandwidth/config.json --output report.html

Analysis Results

Generated from actual experiment runs.

Response: bandwidth_GBs

Pareto Chart

Main Effects Plot

Response: latency_ns

Pareto Chart

Main Effects Plot

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

bandwidth_GBs: thread_count vs buffer_size

latency_ns: thread_count vs buffer_size

Full Analysis Output

doe analyze
=== Main Effects: bandwidth_GBs ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
numa_hop                66.7375      20.9093            43.3%
buffer_size            -59.2917      20.9093            38.5%
thread_count           -16.8250      20.9093            10.9%
transfer_mode           11.2250      20.9093             7.3%

=== Interaction Effects: bandwidth_GBs ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
transfer_mode        buffer_size               53.1583            67.9%
transfer_mode        thread_count              13.2583            16.9%
thread_count         buffer_size               11.9083            15.2%

=== Summary Statistics: bandwidth_GBs ===

numa_hop:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  far                 8   162.7625    94.4747    19.8000   297.6000
  local               8   112.4250   110.3910    15.8000   316.2000
  near                8    96.0250   102.9568    30.8000   271.9000

transfer_mode:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  memcpy             12   118.1250   111.6678    15.8000   316.2000
  streaming_store    12   129.3500    96.9587    19.8000   297.6000

thread_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                  12   132.1500    97.0430    29.9000   316.2000
  14                 12   115.3250   111.2100    15.8000   297.6000

buffer_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1048576            12   153.3833   112.8749    15.8000   316.2000
  268435456          12    94.0917    85.3295    29.7000   297.6000

=== Main Effects: latency_ns ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
numa_hop                58.3750      10.4980            78.9%
buffer_size              8.4917      10.4980            11.5%
thread_count            -3.8250      10.4980             5.2%
transfer_mode           -3.2750      10.4980             4.4%

=== Interaction Effects: latency_ns ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
transfer_mode        thread_count             -12.8250            40.2%
thread_count         buffer_size              -10.3917            32.6%
transfer_mode        buffer_size                8.6583            27.2%

=== Summary Statistics: latency_ns ===

numa_hop:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  far                 8   158.7500    42.0005    83.0000   202.0000
  local               8   141.0125    57.9953    68.6000   212.3000
  near                8   100.3750    38.9371    61.5000   180.2000

transfer_mode:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  memcpy             12   135.0167    47.3819    65.8000   200.5000
  streaming_store    12   131.7417    57.2674    61.5000   212.3000

thread_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                  12   135.2917    53.3590    65.8000   212.3000
  14                 12   131.4667    51.7232    61.5000   205.1000

buffer_size:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1048576            12   129.1333    52.7718    61.5000   202.0000
  268435456          12   137.6250    52.0217    65.8000   212.3000

Optimization Recommendations

doe optimize
=== Optimization: bandwidth_GBs ===
Direction: maximize

Best observed run: #14
  numa_hop = local
  transfer_mode = memcpy
  thread_count = 14
  buffer_size = 268435456
  Value: 316.2

RSM Model (linear, R² = 0.0488, Adj R² = -0.1514):
  Coefficients:
    intercept:  +123.7375
    numa_hop:  -13.7938
    transfer_mode:  +9.9542
    thread_count:  -7.8542
    buffer_size:  +14.2625
  Predicted optimum:
    numa_hop = far
    transfer_mode = streaming_store
    thread_count = 1
    buffer_size = 268435456
    Predicted value: 169.6021

Factor importance:
  1. buffer_size  (effect: 28.5, contribution: 31.1%)
  2. numa_hop  (effect: 27.6, contribution: 30.1%)
  3. transfer_mode  (effect: 19.9, contribution: 21.7%)
  4. thread_count  (effect: -15.7, contribution: 17.1%)

=== Optimization: latency_ns ===
Direction: minimize

Best observed run: #18
  numa_hop = near
  transfer_mode = streaming_store
  thread_count = 1
  buffer_size = 1048576
  Value: 61.5

RSM Model (linear, R² = 0.2038, Adj R² = 0.0362):
  Coefficients:
    intercept:  +133.3792
    numa_hop:  -25.4563
    transfer_mode:  +1.2792
    thread_count:  -0.8292
    buffer_size:  -9.0708
  Predicted optimum:
    numa_hop = far
    transfer_mode = streaming_store
    thread_count = 1
    buffer_size = 1048576
    Predicted value: 170.0146

Factor importance:
  1. numa_hop  (effect: 50.9, contribution: 69.5%)
  2. buffer_size  (effect: -18.1, contribution: 24.8%)
  3. transfer_mode  (effect: 2.6, contribution: 3.5%)
  4. thread_count  (effect: -1.7, contribution: 2.3%)

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.9204

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`bandwidth_GBs`	1.5	0.9545	316.20 0.9545 316.20 GB/s	↑
`latency_ns`	1.0	0.8714	75.30 0.8714 75.30 ns	↓

Recommended Settings

Factor	Value
`numa_hop`	local
`transfer_mode`	memcpy
`thread_count`	14 threads
`buffer_size`	268435456 bytes

Source: from observed run #14

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`latency_ns`	75.30	61.50	+13.80

Top 3 Runs by Desirability

Run	D	Factor Settings
#20	0.8682	numa_hop=far, transfer_mode=memcpy, thread_count=14, buffer_size=1048576
#16	0.8108	numa_hop=near, transfer_mode=memcpy, thread_count=14, buffer_size=1048576

Model Quality

Response	R²	Type
`latency_ns`	0.2229	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.9204

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
bandwidth_GBs                1.5       0.9545      316.20 GB/s   ↑
latency_ns                   1.0       0.8714       75.30 ns   ↓

Recommended settings:
  numa_hop = local
  transfer_mode = memcpy
  thread_count = 14 threads
  buffer_size = 268435456 bytes
  (from observed run #14)

Trade-off summary:
  bandwidth_GBs: 316.20 (best observed: 316.20, sacrifice: +0.00)
  latency_ns: 75.30 (best observed: 61.50, sacrifice: +13.80)

Model quality:
  bandwidth_GBs: R² = 0.3167 (linear)
  latency_ns: R² = 0.2229 (linear)

Top 3 observed runs by overall desirability:
  1. Run #14 (D=0.9204): numa_hop=local, transfer_mode=memcpy, thread_count=14, buffer_size=268435456
  2. Run #20 (D=0.8682): numa_hop=far, transfer_mode=memcpy, thread_count=14, buffer_size=1048576
  3. Run #16 (D=0.8108): numa_hop=near, transfer_mode=memcpy, thread_count=14, buffer_size=1048576