← All Use Cases
Central Composite Design

Test Suite Sharding

Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures

Summary

This experiment investigates test suite sharding. Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures.

The design varies 3 factors: shard count (shards), ranging from 2 to 16, retry flaky count (retries), ranging from 0 to 3, and timeout multiplier (x), ranging from 1.0 to 3.0. The goal is to optimize 2 responses: total wall time min (min) (minimize) and flaky failure rate (%) (minimize). Fixed conditions held constant across all runs include framework = pytest, test count = 4500.

A Central Composite Design (CCD) was selected to fit a full quadratic response surface model, including curvature and interaction effects. With 3 factors this produces 22 runs including center points and axial (star) points that extend beyond the factorial range.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For total wall time min, the most influential factors were timeout multiplier (35.3%), shard count (33.3%), retry flaky count (31.4%). The best observed value was 22.1 (at shard count = -3.78019, retry flaky count = 1.5, timeout multiplier = 2).

For flaky failure rate, the most influential factors were retry flaky count (63.5%), shard count (20.4%), timeout multiplier (16.1%). The best observed value was 2.01 (at shard count = 16, retry flaky count = 3, timeout multiplier = 1).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
shard_count216shards
retry_flaky_count03retries
timeout_multiplier1.03.0x

Fixed: framework = pytest, test_count = 4500

Responses

ResponseDirectionUnit
total_wall_time_min↓ minimizemin
flaky_failure_rate↓ minimize%

Configuration

use_cases/81_test_suite_sharding/config.json
{ "metadata": { "name": "Test Suite Sharding", "description": "Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures" }, "factors": [ { "name": "shard_count", "levels": [ "2", "16" ], "type": "continuous", "unit": "shards" }, { "name": "retry_flaky_count", "levels": [ "0", "3" ], "type": "continuous", "unit": "retries" }, { "name": "timeout_multiplier", "levels": [ "1.0", "3.0" ], "type": "continuous", "unit": "x" } ], "fixed_factors": { "framework": "pytest", "test_count": "4500" }, "responses": [ { "name": "total_wall_time_min", "optimize": "minimize", "unit": "min" }, { "name": "flaky_failure_rate", "optimize": "minimize", "unit": "%" } ], "settings": { "operation": "central_composite", "test_script": "use_cases/81_test_suite_sharding/sim.sh" } }

Experimental Matrix

The Central Composite Design produces 22 runs. Each row is one experiment with specific factor settings.

Runshard_countretry_flaky_counttimeout_multiplier
191.52
21603
3231
494.238612
591.52
6-3.780191.52
791.50.174258
891.52
91631
1021.78021.52
1191.52
129-1.238612
1391.52
14203
1591.52
161601
1791.53.82574
181633
1991.52
20201
21233
2291.52

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/81_test_suite_sharding/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/81_test_suite_sharding/config.json \ --output use_cases/81_test_suite_sharding/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/81_test_suite_sharding/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/81_test_suite_sharding/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/81_test_suite_sharding/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/81_test_suite_sharding/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/81_test_suite_sharding/config.json \ --output use_cases/81_test_suite_sharding/results/report.html

Features Exercised

FeatureValue
Design typecentral_composite
Factor typescontinuous (all 3)
Arg styledouble-dash
Responses2 (total_wall_time_min ↓, flaky_failure_rate ↓)
Total runs22

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: total_wall_time_min

Top factors: timeout_multiplier (35.3%), shard_count (33.3%), retry_flaky_count (31.4%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
shard_count4294.096573.52410.3820.8160
retry_flaky_count4374.493293.62330.4870.7457
timeout_multiplier4354.556588.63910.4610.7630
LackofFit2278.1682139.0841
PureError71346.0187
Error91624.1870192.2884
Total212647.3332126.0635

Pareto Chart

Pareto chart for total_wall_time_min

Main Effects Plot

Main effects plot for total_wall_time_min

Normal Probability Plot of Effects

Normal probability plot for total_wall_time_min

Half-Normal Plot of Effects

Half-normal plot for total_wall_time_min

Model Diagnostics

Model diagnostics for total_wall_time_min

Response: flaky_failure_rate

Top factors: retry_flaky_count (63.5%), shard_count (20.4%), timeout_multiplier (16.1%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
shard_count413.59283.39820.9130.4963
retry_flaky_count488.488322.12215.9460.0127
timeout_multiplier422.55035.63761.5150.2771
LackofFit223.081311.5407
PureError726.0422
Error949.12353.7203
Total21173.75508.2740

Pareto Chart

Pareto chart for flaky_failure_rate

Main Effects Plot

Main effects plot for flaky_failure_rate

Normal Probability Plot of Effects

Normal probability plot for flaky_failure_rate

Half-Normal Plot of Effects

Half-normal plot for flaky_failure_rate

Model Diagnostics

Model diagnostics for flaky_failure_rate

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

flaky failure rate retry flaky count vs timeout multiplier

RSM surface: flaky failure rate retry flaky count vs timeout multiplier

flaky failure rate shard count vs retry flaky count

RSM surface: flaky failure rate shard count vs retry flaky count

flaky failure rate shard count vs timeout multiplier

RSM surface: flaky failure rate shard count vs timeout multiplier

total wall time min retry flaky count vs timeout multiplier

RSM surface: total wall time min retry flaky count vs timeout multiplier

total wall time min shard count vs retry flaky count

RSM surface: total wall time min shard count vs retry flaky count

total wall time min shard count vs timeout multiplier

RSM surface: total wall time min shard count vs timeout multiplier

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9410

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
total_wall_time_min 1.0
0.9209
23.80 0.9209 23.80 min
flaky_failure_rate 1.5
0.9545
2.01 0.9545 2.01 %

Recommended Settings

FactorValue
shard_count9 shards
retry_flaky_count1.5 retries
timeout_multiplier2 x

Source: from observed run #9

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
flaky_failure_rate2.012.01+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#70.8485shard_count=16, retry_flaky_count=3, timeout_multiplier=1
#180.8462shard_count=16, retry_flaky_count=0, timeout_multiplier=3

Model Quality

ResponseType
flaky_failure_rate0.0712linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9410 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- total_wall_time_min 1.0 0.9209 23.80 min ↓ flaky_failure_rate 1.5 0.9545 2.01 % ↓ Recommended settings: shard_count = 9 shards retry_flaky_count = 1.5 retries timeout_multiplier = 2 x (from observed run #9) Trade-off summary: total_wall_time_min: 23.80 (best observed: 22.10, sacrifice: +1.70) flaky_failure_rate: 2.01 (best observed: 2.01, sacrifice: +0.00) Model quality: total_wall_time_min: R² = 0.1215 (linear) flaky_failure_rate: R² = 0.0712 (linear) Top 3 observed runs by overall desirability: 1. Run #9 (D=0.9410): shard_count=9, retry_flaky_count=1.5, timeout_multiplier=2 2. Run #7 (D=0.8485): shard_count=16, retry_flaky_count=3, timeout_multiplier=1 3. Run #18 (D=0.8462): shard_count=16, retry_flaky_count=0, timeout_multiplier=3

Full Analysis Output

doe analyze
=== Main Effects: total_wall_time_min === Factor Effect Std Error % Contribution -------------------------------------------------------------- timeout_multiplier 17.9250 2.3938 35.3% shard_count 16.9000 2.3938 33.3% retry_flaky_count 15.9250 2.3938 31.4% === ANOVA Table: total_wall_time_min === Source DF SS MS F p-value ----------------------------------------------------------------------------- shard_count 4 294.0965 73.5241 0.382 0.8160 retry_flaky_count 4 374.4932 93.6233 0.487 0.7457 timeout_multiplier 4 354.5565 88.6391 0.461 0.7630 Lack of Fit 2 278.1682 139.0841 0.723 0.5181 Pure Error 7 1346.0187 192.2884 Error 9 1624.1870 192.2884 Total 21 2647.3332 126.0635 === Summary Statistics: total_wall_time_min === shard_count: Level N Mean Std Min Max ------------------------------------------------------------ -3.78019 1 25.5000 0.0000 25.5000 25.5000 16 4 39.0000 12.7161 27.6000 54.7000 2 4 30.3750 6.9462 22.1000 39.1000 21.7802 1 42.4000 0.0000 42.4000 42.4000 9 12 33.9917 12.5169 23.8000 68.1000 retry_flaky_count: Level N Mean Std Min Max ------------------------------------------------------------ -1.23861 1 23.8000 0.0000 23.8000 23.8000 0 4 39.7250 11.9036 30.0000 54.7000 1.5 12 35.3750 12.4215 25.1000 68.1000 3 4 29.6500 7.0835 22.1000 39.1000 4.23861 1 27.5000 0.0000 27.5000 27.5000 timeout_multiplier: Level N Mean Std Min Max ------------------------------------------------------------ 0.174258 1 31.4000 0.0000 31.4000 31.4000 1 4 30.9750 9.2676 22.1000 43.9000 2 12 32.9583 12.1521 23.8000 68.1000 3 4 38.4000 11.7004 29.8000 54.7000 3.82574 1 48.9000 0.0000 48.9000 48.9000 === Main Effects: flaky_failure_rate === Factor Effect Std Error % Contribution -------------------------------------------------------------- retry_flaky_count 11.9400 0.6133 63.5% shard_count 3.8425 0.6133 20.4% timeout_multiplier 3.0225 0.6133 16.1% === ANOVA Table: flaky_failure_rate === Source DF SS MS F p-value ----------------------------------------------------------------------------- shard_count 4 13.5928 3.3982 0.913 0.4963 retry_flaky_count 4 88.4883 22.1221 5.946 0.0127 timeout_multiplier 4 22.5503 5.6376 1.515 0.2771 Lack of Fit 2 23.0813 11.5407 3.102 0.1085 Pure Error 7 26.0422 3.7203 Error 9 49.1235 3.7203 Total 21 173.7550 8.2740 === Summary Statistics: flaky_failure_rate === shard_count: Level N Mean Std Min Max ------------------------------------------------------------ -3.78019 1 4.8700 0.0000 4.8700 4.8700 16 4 6.0550 3.5811 3.4300 11.3300 2 4 4.8175 1.8627 2.5000 7.0600 21.7802 1 8.6600 0.0000 8.6600 8.6600 9 12 6.0117 3.1806 2.0100 13.9500 retry_flaky_count: Level N Mean Std Min Max ------------------------------------------------------------ -1.23861 1 2.0100 0.0000 2.0100 2.0100 0 4 6.3550 3.3245 4.3800 11.3300 1.5 12 5.8092 1.9087 3.5500 9.9400 3 4 4.5175 2.0028 2.5000 7.0600 4.23861 1 13.9500 0.0000 13.9500 13.9500 timeout_multiplier: Level N Mean Std Min Max ------------------------------------------------------------ 0.174258 1 5.2400 0.0000 5.2400 5.2400 1 4 6.6825 3.4381 3.4300 11.3300 2 12 6.3975 3.1864 2.0100 13.9500 3 4 4.1900 1.1628 2.5000 5.0800 3.82574 1 3.6600 0.0000 3.6600 3.6600

Optimization Recommendations

doe optimize
=== Optimization: total_wall_time_min === Direction: minimize Best observed run: #16 shard_count = -3.78019 retry_flaky_count = 1.5 timeout_multiplier = 2 Value: 22.1 RSM Model (linear, R² = 0.2364, Adj R² = 0.1091): Coefficients: intercept +34.2409 shard_count -0.1528 retry_flaky_count +0.0182 timeout_multiplier -6.5303 RSM Model (quadratic, R² = 0.7003, Adj R² = 0.4755): Coefficients: intercept +34.1449 shard_count -0.1528 retry_flaky_count +0.0182 timeout_multiplier -6.5303 shard_count*retry_flaky_count -8.4375 shard_count*timeout_multiplier +6.0375 retry_flaky_count*timeout_multiplier -3.5125 shard_count^2 -2.3270 retry_flaky_count^2 -0.1070 timeout_multiplier^2 +2.5780 Curvature analysis: timeout_multiplier coef=+2.5780 convex (has a minimum) shard_count coef=-2.3270 concave (has a maximum) retry_flaky_count coef=-0.1070 concave (has a maximum) Notable interactions: shard_count*retry_flaky_count coef=-8.4375 (antagonistic) shard_count*timeout_multiplier coef=+6.0375 (synergistic) retry_flaky_count*timeout_multiplier coef=-3.5125 (antagonistic) Predicted optimum (from quadratic model, at observed points): shard_count = 2 retry_flaky_count = 3 timeout_multiplier = 1 Predicted value: 58.9777 Surface optimum (via L-BFGS-B, quadratic model): shard_count = 2 retry_flaky_count = 0 timeout_multiplier = 3 Predicted value: 16.9308 Model quality: Good fit — general trends are captured, some noise remains. Factor importance: 1. timeout_multiplier (effect: 27.1, contribution: 55.6%) 2. shard_count (effect: 15.0, contribution: 30.8%) 3. retry_flaky_count (effect: 6.6, contribution: 13.6%) === Optimization: flaky_failure_rate === Direction: minimize Best observed run: #9 shard_count = 16 retry_flaky_count = 3 timeout_multiplier = 1 Value: 2.01 RSM Model (linear, R² = 0.0325, Adj R² = -0.1288): Coefficients: intercept +5.8709 shard_count +0.4643 retry_flaky_count -0.2689 timeout_multiplier -0.3112 RSM Model (quadratic, R² = 0.6937, Adj R² = 0.4639): Coefficients: intercept +5.1553 shard_count +0.4643 retry_flaky_count -0.2689 timeout_multiplier -0.3112 shard_count*retry_flaky_count -1.6913 shard_count*timeout_multiplier -1.1313 retry_flaky_count*timeout_multiplier +1.9063 shard_count^2 +1.4888 retry_flaky_count^2 +0.0758 timeout_multiplier^2 -0.4912 Curvature analysis: shard_count coef=+1.4888 convex (has a minimum) timeout_multiplier coef=-0.4912 concave (has a maximum) retry_flaky_count coef=+0.0758 negligible curvature Notable interactions: retry_flaky_count*timeout_multiplier coef=+1.9063 (synergistic) shard_count*retry_flaky_count coef=-1.6913 (antagonistic) shard_count*timeout_multiplier coef=-1.1313 (antagonistic) Predicted optimum (from quadratic model, at observed points): shard_count = 16 retry_flaky_count = 0 timeout_multiplier = 1 Predicted value: 12.0018 Surface optimum (via L-BFGS-B, quadratic model): shard_count = 6.5921 retry_flaky_count = 0 timeout_multiplier = 3 Predicted value: 2.6151 Model quality: Moderate fit — use predictions directionally, not precisely. Factor importance: 1. shard_count (effect: 8.9, contribution: 64.8%) 2. timeout_multiplier (effect: 2.8, contribution: 20.7%) 3. retry_flaky_count (effect: 2.0, contribution: 14.5%)
← Previous: Docker Build Layer Caching Next: GitOps Sync Interval →