Test Suite Sharding — DOE Use Case

Summary

This experiment investigates test suite sharding. Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures.

The design varies 3 factors: shard count (shards), ranging from 2 to 16, retry flaky count (retries), ranging from 0 to 3, and timeout multiplier (x), ranging from 1.0 to 3.0. The goal is to optimize 2 responses: total wall time min (min) (minimize) and flaky failure rate (%) (minimize). Fixed conditions held constant across all runs include framework = pytest, test count = 4500.

A Central Composite Design (CCD) was selected to fit a full quadratic response surface model, including curvature and interaction effects. With 3 factors this produces 22 runs including center points and axial (star) points that extend beyond the factorial range.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For total wall time min, the most influential factors were timeout multiplier (35.3%), shard count (33.3%), retry flaky count (31.4%). The best observed value was 22.1 (at shard count = -3.78019, retry flaky count = 1.5, timeout multiplier = 2).

For flaky failure rate, the most influential factors were retry flaky count (63.5%), shard count (20.4%), timeout multiplier (16.1%). The best observed value was 2.01 (at shard count = 16, retry flaky count = 3, timeout multiplier = 1).

Recommended Next Steps

Run confirmation experiments at the predicted optimal settings to validate the model.
Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Low	High	Unit
`shard_count`	2	16	shards
`retry_flaky_count`	0	3	retries
`timeout_multiplier`	1.0	3.0	x

Fixed: framework = pytest, test_count = 4500

Responses

Response	Direction	Unit
`total_wall_time_min`	↓ minimize	min
`flaky_failure_rate`	↓ minimize	%

Configuration

use_cases/81_test_suite_sharding/config.json

{
  "metadata": {
    "name": "Test Suite Sharding",
    "description": "Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures"
  },
  "factors": [
    {
      "name": "shard_count",
      "levels": [
        "2",
        "16"
      ],
      "type": "continuous",
      "unit": "shards"
    },
    {
      "name": "retry_flaky_count",
      "levels": [
        "0",
        "3"
      ],
      "type": "continuous",
      "unit": "retries"
    },
    {
      "name": "timeout_multiplier",
      "levels": [
        "1.0",
        "3.0"
      ],
      "type": "continuous",
      "unit": "x"
    }
  ],
  "fixed_factors": {
    "framework": "pytest",
    "test_count": "4500"
  },
  "responses": [
    {
      "name": "total_wall_time_min",
      "optimize": "minimize",
      "unit": "min"
    },
    {
      "name": "flaky_failure_rate",
      "optimize": "minimize",
      "unit": "%"
    }
  ],
  "settings": {
    "operation": "central_composite",
    "test_script": "use_cases/81_test_suite_sharding/sim.sh"
  }
}

Experimental Matrix

The Central Composite Design produces 22 runs. Each row is one experiment with specific factor settings.

Run	`shard_count`	`retry_flaky_count`	`timeout_multiplier`
1	9	1.5	2
2	16	0	3
3	2	3	1
4	9	4.23861	2
5	9	1.5	2
6	-3.78019	1.5	2
7	9	1.5	0.174258
8	9	1.5	2
9	16	3	1
10	21.7802	1.5	2
11	9	1.5	2
12	9	-1.23861	2
13	9	1.5	2
14	2	0	3
15	9	1.5	2
16	16	0	1
17	9	1.5	3.82574
18	16	3	3
19	9	1.5	2
20	2	0	1
21	2	3	3
22	9	1.5	2

Step-by-Step Workflow

1

Preview the design

Terminal

$ doe info --config use_cases/81_test_suite_sharding/config.json

2

Generate the runner script

Terminal

$ doe generate --config use_cases/81_test_suite_sharding/config.json \
    --output use_cases/81_test_suite_sharding/results/run.sh --seed 42

3

Execute the experiments

Terminal

$ bash use_cases/81_test_suite_sharding/results/run.sh

4

Analyze results

Terminal

$ doe analyze --config use_cases/81_test_suite_sharding/config.json

5

Get optimization recommendations

Terminal

$ doe optimize --config use_cases/81_test_suite_sharding/config.json

6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal

$ doe optimize --config use_cases/81_test_suite_sharding/config.json --multi

7

Generate the HTML report

Terminal

$ doe report --config use_cases/81_test_suite_sharding/config.json \
    --output use_cases/81_test_suite_sharding/results/report.html

Features Exercised

Feature	Value
Design type	`central_composite`
Factor types	`continuous` (all 3)
Arg style	`double-dash`
Responses	2 (total_wall_time_min ↓, flaky_failure_rate ↓)
Total runs	22

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: total_wall_time_min

Top factors: timeout_multiplier (35.3%), shard_count (33.3%), retry_flaky_count (31.4%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
shard_count	4	294.0965	73.5241	0.382	0.8160
retry_flaky_count	4	374.4932	93.6233	0.487	0.7457
timeout_multiplier	4	354.5565	88.6391	0.461	0.7630
Lack	of	Fit	2	278.1682	139.0841
Pure	Error	7	1346.0187
Error	9	1624.1870	192.2884
Total	21	2647.3332	126.0635

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for total_wall_time_min

Half-Normal Plot of Effects

Half-normal plot for total_wall_time_min

Model Diagnostics

Response: flaky_failure_rate

Top factors: retry_flaky_count (63.5%), shard_count (20.4%), timeout_multiplier (16.1%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
shard_count	4	13.5928	3.3982	0.913	0.4963
retry_flaky_count	4	88.4883	22.1221	5.946	0.0127
timeout_multiplier	4	22.5503	5.6376	1.515	0.2771
Lack	of	Fit	2	23.0813	11.5407
Pure	Error	7	26.0422
Error	9	49.1235	3.7203
Total	21	173.7550	8.2740

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for flaky_failure_rate

Half-Normal Plot of Effects

Model Diagnostics

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

flaky failure rate retry flaky count vs timeout multiplier

flaky failure rate shard count vs retry flaky count

flaky failure rate shard count vs timeout multiplier

total wall time min retry flaky count vs timeout multiplier

total wall time min shard count vs retry flaky count

total wall time min shard count vs timeout multiplier

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.9410

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`total_wall_time_min`	1.0	0.9209	23.80 0.9209 23.80 min	↓
`flaky_failure_rate`	1.5	0.9545	2.01 0.9545 2.01 %	↓

Recommended Settings

Factor	Value
`shard_count`	9 shards
`retry_flaky_count`	1.5 retries
`timeout_multiplier`	2 x

Source: from observed run #9

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`flaky_failure_rate`	2.01	2.01	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#7	0.8485	shard_count=16, retry_flaky_count=3, timeout_multiplier=1
#18	0.8462	shard_count=16, retry_flaky_count=0, timeout_multiplier=3

Model Quality

Response	R²	Type
`flaky_failure_rate`	0.0712	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.9410

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
total_wall_time_min          1.0       0.9209       23.80 min   ↓
flaky_failure_rate           1.5       0.9545        2.01 %   ↓

Recommended settings:
  shard_count = 9 shards
  retry_flaky_count = 1.5 retries
  timeout_multiplier = 2 x
  (from observed run #9)

Trade-off summary:
  total_wall_time_min: 23.80 (best observed: 22.10, sacrifice: +1.70)
  flaky_failure_rate: 2.01 (best observed: 2.01, sacrifice: +0.00)

Model quality:
  total_wall_time_min: R² = 0.1215 (linear)
  flaky_failure_rate: R² = 0.0712 (linear)

Top 3 observed runs by overall desirability:
  1. Run #9 (D=0.9410): shard_count=9, retry_flaky_count=1.5, timeout_multiplier=2
  2. Run #7 (D=0.8485): shard_count=16, retry_flaky_count=3, timeout_multiplier=1
  3. Run #18 (D=0.8462): shard_count=16, retry_flaky_count=0, timeout_multiplier=3

Full Analysis Output

doe analyze
=== Main Effects: total_wall_time_min ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
timeout_multiplier      17.9250       2.3938            35.3%
shard_count             16.9000       2.3938            33.3%
retry_flaky_count       15.9250       2.3938            31.4%

=== ANOVA Table: total_wall_time_min ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
shard_count                  4     294.0965      73.5241      0.382     0.8160
retry_flaky_count            4     374.4932      93.6233      0.487     0.7457
timeout_multiplier           4     354.5565      88.6391      0.461     0.7630
Lack of Fit                  2     278.1682     139.0841      0.723     0.5181
Pure Error                   7    1346.0187     192.2884
Error                        9    1624.1870     192.2884
Total                       21    2647.3332     126.0635

=== Summary Statistics: total_wall_time_min ===

shard_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  -3.78019            1    25.5000     0.0000    25.5000    25.5000
  16                  4    39.0000    12.7161    27.6000    54.7000
  2                   4    30.3750     6.9462    22.1000    39.1000
  21.7802             1    42.4000     0.0000    42.4000    42.4000
  9                  12    33.9917    12.5169    23.8000    68.1000

retry_flaky_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  -1.23861            1    23.8000     0.0000    23.8000    23.8000
  0                   4    39.7250    11.9036    30.0000    54.7000
  1.5                12    35.3750    12.4215    25.1000    68.1000
  3                   4    29.6500     7.0835    22.1000    39.1000
  4.23861             1    27.5000     0.0000    27.5000    27.5000

timeout_multiplier:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.174258            1    31.4000     0.0000    31.4000    31.4000
  1                   4    30.9750     9.2676    22.1000    43.9000
  2                  12    32.9583    12.1521    23.8000    68.1000
  3                   4    38.4000    11.7004    29.8000    54.7000
  3.82574             1    48.9000     0.0000    48.9000    48.9000

=== Main Effects: flaky_failure_rate ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
retry_flaky_count       11.9400       0.6133            63.5%
shard_count              3.8425       0.6133            20.4%
timeout_multiplier       3.0225       0.6133            16.1%

=== ANOVA Table: flaky_failure_rate ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
shard_count                  4      13.5928       3.3982      0.913     0.4963
retry_flaky_count            4      88.4883      22.1221      5.946     0.0127
timeout_multiplier           4      22.5503       5.6376      1.515     0.2771
Lack of Fit                  2      23.0813      11.5407      3.102     0.1085
Pure Error                   7      26.0422       3.7203
Error                        9      49.1235       3.7203
Total                       21     173.7550       8.2740

=== Summary Statistics: flaky_failure_rate ===

shard_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  -3.78019            1     4.8700     0.0000     4.8700     4.8700
  16                  4     6.0550     3.5811     3.4300    11.3300
  2                   4     4.8175     1.8627     2.5000     7.0600
  21.7802             1     8.6600     0.0000     8.6600     8.6600
  9                  12     6.0117     3.1806     2.0100    13.9500

retry_flaky_count:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  -1.23861            1     2.0100     0.0000     2.0100     2.0100
  0                   4     6.3550     3.3245     4.3800    11.3300
  1.5                12     5.8092     1.9087     3.5500     9.9400
  3                   4     4.5175     2.0028     2.5000     7.0600
  4.23861             1    13.9500     0.0000    13.9500    13.9500

timeout_multiplier:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  0.174258            1     5.2400     0.0000     5.2400     5.2400
  1                   4     6.6825     3.4381     3.4300    11.3300
  2                  12     6.3975     3.1864     2.0100    13.9500
  3                   4     4.1900     1.1628     2.5000     5.0800
  3.82574             1     3.6600     0.0000     3.6600     3.6600

Optimization Recommendations

doe optimize
=== Optimization: total_wall_time_min ===
Direction: minimize

Best observed run: #16
  shard_count = -3.78019
  retry_flaky_count = 1.5
  timeout_multiplier = 2
  Value: 22.1

RSM Model (linear, R² = 0.2364, Adj R² = 0.1091):
  Coefficients:
    intercept                      +34.2409
    shard_count                    -0.1528
    retry_flaky_count              +0.0182
    timeout_multiplier             -6.5303

RSM Model (quadratic, R² = 0.7003, Adj R² = 0.4755):
  Coefficients:
    intercept                      +34.1449
    shard_count                    -0.1528
    retry_flaky_count              +0.0182
    timeout_multiplier             -6.5303
    shard_count*retry_flaky_count  -8.4375
    shard_count*timeout_multiplier +6.0375
    retry_flaky_count*timeout_multiplier -3.5125
    shard_count^2                  -2.3270
    retry_flaky_count^2            -0.1070
    timeout_multiplier^2           +2.5780

  Curvature analysis:
    timeout_multiplier             coef=+2.5780  convex (has a minimum)
    shard_count                    coef=-2.3270  concave (has a maximum)
    retry_flaky_count              coef=-0.1070  concave (has a maximum)

  Notable interactions:
    shard_count*retry_flaky_count  coef=-8.4375  (antagonistic)
    shard_count*timeout_multiplier coef=+6.0375  (synergistic)
    retry_flaky_count*timeout_multiplier coef=-3.5125  (antagonistic)

  Predicted optimum (from quadratic model, at observed points):
    shard_count = 2
    retry_flaky_count = 3
    timeout_multiplier = 1
    Predicted value: 58.9777

  Surface optimum (via L-BFGS-B, quadratic model):
    shard_count = 2
    retry_flaky_count = 0
    timeout_multiplier = 3
    Predicted value: 16.9308

  Model quality: Good fit — general trends are captured, some noise remains.

Factor importance:
  1. timeout_multiplier  (effect: 27.1, contribution: 55.6%)
  2. shard_count  (effect: 15.0, contribution: 30.8%)
  3. retry_flaky_count  (effect: 6.6, contribution: 13.6%)

=== Optimization: flaky_failure_rate ===
Direction: minimize

Best observed run: #9
  shard_count = 16
  retry_flaky_count = 3
  timeout_multiplier = 1
  Value: 2.01

RSM Model (linear, R² = 0.0325, Adj R² = -0.1288):
  Coefficients:
    intercept                      +5.8709
    shard_count                    +0.4643
    retry_flaky_count              -0.2689
    timeout_multiplier             -0.3112

RSM Model (quadratic, R² = 0.6937, Adj R² = 0.4639):
  Coefficients:
    intercept                      +5.1553
    shard_count                    +0.4643
    retry_flaky_count              -0.2689
    timeout_multiplier             -0.3112
    shard_count*retry_flaky_count  -1.6913
    shard_count*timeout_multiplier -1.1313
    retry_flaky_count*timeout_multiplier +1.9063
    shard_count^2                  +1.4888
    retry_flaky_count^2            +0.0758
    timeout_multiplier^2           -0.4912

  Curvature analysis:
    shard_count                    coef=+1.4888  convex (has a minimum)
    timeout_multiplier             coef=-0.4912  concave (has a maximum)
    retry_flaky_count              coef=+0.0758  negligible curvature

  Notable interactions:
    retry_flaky_count*timeout_multiplier coef=+1.9063  (synergistic)
    shard_count*retry_flaky_count  coef=-1.6913  (antagonistic)
    shard_count*timeout_multiplier coef=-1.1313  (antagonistic)

  Predicted optimum (from quadratic model, at observed points):
    shard_count = 16
    retry_flaky_count = 0
    timeout_multiplier = 1
    Predicted value: 12.0018

  Surface optimum (via L-BFGS-B, quadratic model):
    shard_count = 6.5921
    retry_flaky_count = 0
    timeout_multiplier = 3
    Predicted value: 2.6151

  Model quality: Moderate fit — use predictions directionally, not precisely.

Factor importance:
  1. shard_count  (effect: 8.9, contribution: 64.8%)
  2. timeout_multiplier  (effect: 2.8, contribution: 20.7%)
  3. retry_flaky_count  (effect: 2.0, contribution: 14.5%)