Summary
This experiment investigates test suite sharding. Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures.
The design varies 3 factors: shard count (shards), ranging from 2 to 16, retry flaky count (retries), ranging from 0 to 3, and timeout multiplier (x), ranging from 1.0 to 3.0. The goal is to optimize 2 responses: total wall time min (min) (minimize) and flaky failure rate (%) (minimize). Fixed conditions held constant across all runs include framework = pytest, test count = 4500.
A Central Composite Design (CCD) was selected to fit a full quadratic response surface model, including curvature and interaction effects. With 3 factors this produces 22 runs including center points and axial (star) points that extend beyond the factorial range.
Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.
Key Findings
For total wall time min, the most influential factors were timeout multiplier (35.3%), shard count (33.3%), retry flaky count (31.4%). The best observed value was 22.1 (at shard count = -3.78019, retry flaky count = 1.5, timeout multiplier = 2).
For flaky failure rate, the most influential factors were retry flaky count (63.5%), shard count (20.4%), timeout multiplier (16.1%). The best observed value was 2.01 (at shard count = 16, retry flaky count = 3, timeout multiplier = 1).
Recommended Next Steps
- Run confirmation experiments at the predicted optimal settings to validate the model.
- Consider whether any fixed factors should be varied in a future study.
Experimental Setup
Factors
| Factor | Low | High | Unit |
shard_count | 2 | 16 | shards |
retry_flaky_count | 0 | 3 | retries |
timeout_multiplier | 1.0 | 3.0 | x |
Fixed: framework = pytest, test_count = 4500
Responses
| Response | Direction | Unit |
total_wall_time_min | ↓ minimize | min |
flaky_failure_rate | ↓ minimize | % |
Configuration
{
"metadata": {
"name": "Test Suite Sharding",
"description": "Central Composite design to optimize shard count, retry count, and timeout multiplier for wall time and flaky failures"
},
"factors": [
{
"name": "shard_count",
"levels": [
"2",
"16"
],
"type": "continuous",
"unit": "shards"
},
{
"name": "retry_flaky_count",
"levels": [
"0",
"3"
],
"type": "continuous",
"unit": "retries"
},
{
"name": "timeout_multiplier",
"levels": [
"1.0",
"3.0"
],
"type": "continuous",
"unit": "x"
}
],
"fixed_factors": {
"framework": "pytest",
"test_count": "4500"
},
"responses": [
{
"name": "total_wall_time_min",
"optimize": "minimize",
"unit": "min"
},
{
"name": "flaky_failure_rate",
"optimize": "minimize",
"unit": "%"
}
],
"settings": {
"operation": "central_composite",
"test_script": "use_cases/81_test_suite_sharding/sim.sh"
}
}
Experimental Matrix
The Central Composite Design produces 22 runs. Each row is one experiment with specific factor settings.
| Run | shard_count | retry_flaky_count | timeout_multiplier |
| 1 | 9 | 1.5 | 2 |
| 2 | 16 | 0 | 3 |
| 3 | 2 | 3 | 1 |
| 4 | 9 | 4.23861 | 2 |
| 5 | 9 | 1.5 | 2 |
| 6 | -3.78019 | 1.5 | 2 |
| 7 | 9 | 1.5 | 0.174258 |
| 8 | 9 | 1.5 | 2 |
| 9 | 16 | 3 | 1 |
| 10 | 21.7802 | 1.5 | 2 |
| 11 | 9 | 1.5 | 2 |
| 12 | 9 | -1.23861 | 2 |
| 13 | 9 | 1.5 | 2 |
| 14 | 2 | 0 | 3 |
| 15 | 9 | 1.5 | 2 |
| 16 | 16 | 0 | 1 |
| 17 | 9 | 1.5 | 3.82574 |
| 18 | 16 | 3 | 3 |
| 19 | 9 | 1.5 | 2 |
| 20 | 2 | 0 | 1 |
| 21 | 2 | 3 | 3 |
| 22 | 9 | 1.5 | 2 |
Step-by-Step Workflow
1
Preview the design
$ doe info --config use_cases/81_test_suite_sharding/config.json
2
Generate the runner script
$ doe generate --config use_cases/81_test_suite_sharding/config.json \
--output use_cases/81_test_suite_sharding/results/run.sh --seed 42
3
Execute the experiments
$ bash use_cases/81_test_suite_sharding/results/run.sh
4
Analyze results
$ doe analyze --config use_cases/81_test_suite_sharding/config.json
5
Get optimization recommendations
$ doe optimize --config use_cases/81_test_suite_sharding/config.json
6
Multi-objective optimization
With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.
$ doe optimize --config use_cases/81_test_suite_sharding/config.json --multi
7
Generate the HTML report
$ doe report --config use_cases/81_test_suite_sharding/config.json \
--output use_cases/81_test_suite_sharding/results/report.html
Features Exercised
| Feature | Value |
| Design type | central_composite |
| Factor types | continuous (all 3) |
| Arg style | double-dash |
| Responses | 2 (total_wall_time_min ↓, flaky_failure_rate ↓) |
| Total runs | 22 |
Analysis Results
Generated from actual experiment runs using the DOE Helper Tool.
Response: total_wall_time_min
Top factors: timeout_multiplier (35.3%), shard_count (33.3%), retry_flaky_count (31.4%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| shard_count | 4 | 294.0965 | 73.5241 | 0.382 | 0.8160 |
| retry_flaky_count | 4 | 374.4932 | 93.6233 | 0.487 | 0.7457 |
| timeout_multiplier | 4 | 354.5565 | 88.6391 | 0.461 | 0.7630 |
| Lack | of | Fit | 2 | 278.1682 | 139.0841 |
| Pure | Error | 7 | 1346.0187 | | |
| Error | 9 | 1624.1870 | 192.2884 | | |
| Total | 21 | 2647.3332 | 126.0635 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response: flaky_failure_rate
Top factors: retry_flaky_count (63.5%), shard_count (20.4%), timeout_multiplier (16.1%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| shard_count | 4 | 13.5928 | 3.3982 | 0.913 | 0.4963 |
| retry_flaky_count | 4 | 88.4883 | 22.1221 | 5.946 | 0.0127 |
| timeout_multiplier | 4 | 22.5503 | 5.6376 | 1.515 | 0.2771 |
| Lack | of | Fit | 2 | 23.0813 | 11.5407 |
| Pure | Error | 7 | 26.0422 | | |
| Error | 9 | 49.1235 | 3.7203 | | |
| Total | 21 | 173.7550 | 8.2740 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
flaky failure rate retry flaky count vs timeout multiplier
flaky failure rate shard count vs retry flaky count
flaky failure rate shard count vs timeout multiplier
total wall time min retry flaky count vs timeout multiplier
total wall time min shard count vs retry flaky count
total wall time min shard count vs timeout multiplier
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.9410
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
total_wall_time_min |
1.0 |
|
23.80 0.9209 23.80 min |
↓ |
flaky_failure_rate |
1.5 |
|
2.01 0.9545 2.01 % |
↓ |
Recommended Settings
| Factor | Value |
shard_count | 9 shards |
retry_flaky_count | 1.5 retries |
timeout_multiplier | 2 x |
Source: from observed run #9
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
flaky_failure_rate | 2.01 | 2.01 | +0.00 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #7 | 0.8485 | shard_count=16, retry_flaky_count=3, timeout_multiplier=1 |
| #18 | 0.8462 | shard_count=16, retry_flaky_count=0, timeout_multiplier=3 |
Model Quality
| Response | R² | Type |
flaky_failure_rate | 0.0712 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.9410
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
total_wall_time_min 1.0 0.9209 23.80 min ↓
flaky_failure_rate 1.5 0.9545 2.01 % ↓
Recommended settings:
shard_count = 9 shards
retry_flaky_count = 1.5 retries
timeout_multiplier = 2 x
(from observed run #9)
Trade-off summary:
total_wall_time_min: 23.80 (best observed: 22.10, sacrifice: +1.70)
flaky_failure_rate: 2.01 (best observed: 2.01, sacrifice: +0.00)
Model quality:
total_wall_time_min: R² = 0.1215 (linear)
flaky_failure_rate: R² = 0.0712 (linear)
Top 3 observed runs by overall desirability:
1. Run #9 (D=0.9410): shard_count=9, retry_flaky_count=1.5, timeout_multiplier=2
2. Run #7 (D=0.8485): shard_count=16, retry_flaky_count=3, timeout_multiplier=1
3. Run #18 (D=0.8462): shard_count=16, retry_flaky_count=0, timeout_multiplier=3
Full Analysis Output
=== Main Effects: total_wall_time_min ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
timeout_multiplier 17.9250 2.3938 35.3%
shard_count 16.9000 2.3938 33.3%
retry_flaky_count 15.9250 2.3938 31.4%
=== ANOVA Table: total_wall_time_min ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
shard_count 4 294.0965 73.5241 0.382 0.8160
retry_flaky_count 4 374.4932 93.6233 0.487 0.7457
timeout_multiplier 4 354.5565 88.6391 0.461 0.7630
Lack of Fit 2 278.1682 139.0841 0.723 0.5181
Pure Error 7 1346.0187 192.2884
Error 9 1624.1870 192.2884
Total 21 2647.3332 126.0635
=== Summary Statistics: total_wall_time_min ===
shard_count:
Level N Mean Std Min Max
------------------------------------------------------------
-3.78019 1 25.5000 0.0000 25.5000 25.5000
16 4 39.0000 12.7161 27.6000 54.7000
2 4 30.3750 6.9462 22.1000 39.1000
21.7802 1 42.4000 0.0000 42.4000 42.4000
9 12 33.9917 12.5169 23.8000 68.1000
retry_flaky_count:
Level N Mean Std Min Max
------------------------------------------------------------
-1.23861 1 23.8000 0.0000 23.8000 23.8000
0 4 39.7250 11.9036 30.0000 54.7000
1.5 12 35.3750 12.4215 25.1000 68.1000
3 4 29.6500 7.0835 22.1000 39.1000
4.23861 1 27.5000 0.0000 27.5000 27.5000
timeout_multiplier:
Level N Mean Std Min Max
------------------------------------------------------------
0.174258 1 31.4000 0.0000 31.4000 31.4000
1 4 30.9750 9.2676 22.1000 43.9000
2 12 32.9583 12.1521 23.8000 68.1000
3 4 38.4000 11.7004 29.8000 54.7000
3.82574 1 48.9000 0.0000 48.9000 48.9000
=== Main Effects: flaky_failure_rate ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
retry_flaky_count 11.9400 0.6133 63.5%
shard_count 3.8425 0.6133 20.4%
timeout_multiplier 3.0225 0.6133 16.1%
=== ANOVA Table: flaky_failure_rate ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
shard_count 4 13.5928 3.3982 0.913 0.4963
retry_flaky_count 4 88.4883 22.1221 5.946 0.0127
timeout_multiplier 4 22.5503 5.6376 1.515 0.2771
Lack of Fit 2 23.0813 11.5407 3.102 0.1085
Pure Error 7 26.0422 3.7203
Error 9 49.1235 3.7203
Total 21 173.7550 8.2740
=== Summary Statistics: flaky_failure_rate ===
shard_count:
Level N Mean Std Min Max
------------------------------------------------------------
-3.78019 1 4.8700 0.0000 4.8700 4.8700
16 4 6.0550 3.5811 3.4300 11.3300
2 4 4.8175 1.8627 2.5000 7.0600
21.7802 1 8.6600 0.0000 8.6600 8.6600
9 12 6.0117 3.1806 2.0100 13.9500
retry_flaky_count:
Level N Mean Std Min Max
------------------------------------------------------------
-1.23861 1 2.0100 0.0000 2.0100 2.0100
0 4 6.3550 3.3245 4.3800 11.3300
1.5 12 5.8092 1.9087 3.5500 9.9400
3 4 4.5175 2.0028 2.5000 7.0600
4.23861 1 13.9500 0.0000 13.9500 13.9500
timeout_multiplier:
Level N Mean Std Min Max
------------------------------------------------------------
0.174258 1 5.2400 0.0000 5.2400 5.2400
1 4 6.6825 3.4381 3.4300 11.3300
2 12 6.3975 3.1864 2.0100 13.9500
3 4 4.1900 1.1628 2.5000 5.0800
3.82574 1 3.6600 0.0000 3.6600 3.6600
Optimization Recommendations
=== Optimization: total_wall_time_min ===
Direction: minimize
Best observed run: #16
shard_count = -3.78019
retry_flaky_count = 1.5
timeout_multiplier = 2
Value: 22.1
RSM Model (linear, R² = 0.2364, Adj R² = 0.1091):
Coefficients:
intercept +34.2409
shard_count -0.1528
retry_flaky_count +0.0182
timeout_multiplier -6.5303
RSM Model (quadratic, R² = 0.7003, Adj R² = 0.4755):
Coefficients:
intercept +34.1449
shard_count -0.1528
retry_flaky_count +0.0182
timeout_multiplier -6.5303
shard_count*retry_flaky_count -8.4375
shard_count*timeout_multiplier +6.0375
retry_flaky_count*timeout_multiplier -3.5125
shard_count^2 -2.3270
retry_flaky_count^2 -0.1070
timeout_multiplier^2 +2.5780
Curvature analysis:
timeout_multiplier coef=+2.5780 convex (has a minimum)
shard_count coef=-2.3270 concave (has a maximum)
retry_flaky_count coef=-0.1070 concave (has a maximum)
Notable interactions:
shard_count*retry_flaky_count coef=-8.4375 (antagonistic)
shard_count*timeout_multiplier coef=+6.0375 (synergistic)
retry_flaky_count*timeout_multiplier coef=-3.5125 (antagonistic)
Predicted optimum (from quadratic model, at observed points):
shard_count = 2
retry_flaky_count = 3
timeout_multiplier = 1
Predicted value: 58.9777
Surface optimum (via L-BFGS-B, quadratic model):
shard_count = 2
retry_flaky_count = 0
timeout_multiplier = 3
Predicted value: 16.9308
Model quality: Good fit — general trends are captured, some noise remains.
Factor importance:
1. timeout_multiplier (effect: 27.1, contribution: 55.6%)
2. shard_count (effect: 15.0, contribution: 30.8%)
3. retry_flaky_count (effect: 6.6, contribution: 13.6%)
=== Optimization: flaky_failure_rate ===
Direction: minimize
Best observed run: #9
shard_count = 16
retry_flaky_count = 3
timeout_multiplier = 1
Value: 2.01
RSM Model (linear, R² = 0.0325, Adj R² = -0.1288):
Coefficients:
intercept +5.8709
shard_count +0.4643
retry_flaky_count -0.2689
timeout_multiplier -0.3112
RSM Model (quadratic, R² = 0.6937, Adj R² = 0.4639):
Coefficients:
intercept +5.1553
shard_count +0.4643
retry_flaky_count -0.2689
timeout_multiplier -0.3112
shard_count*retry_flaky_count -1.6913
shard_count*timeout_multiplier -1.1313
retry_flaky_count*timeout_multiplier +1.9063
shard_count^2 +1.4888
retry_flaky_count^2 +0.0758
timeout_multiplier^2 -0.4912
Curvature analysis:
shard_count coef=+1.4888 convex (has a minimum)
timeout_multiplier coef=-0.4912 concave (has a maximum)
retry_flaky_count coef=+0.0758 negligible curvature
Notable interactions:
retry_flaky_count*timeout_multiplier coef=+1.9063 (synergistic)
shard_count*retry_flaky_count coef=-1.6913 (antagonistic)
shard_count*timeout_multiplier coef=-1.1313 (antagonistic)
Predicted optimum (from quadratic model, at observed points):
shard_count = 16
retry_flaky_count = 0
timeout_multiplier = 1
Predicted value: 12.0018
Surface optimum (via L-BFGS-B, quadratic model):
shard_count = 6.5921
retry_flaky_count = 0
timeout_multiplier = 3
Predicted value: 2.6151
Model quality: Moderate fit — use predictions directionally, not precisely.
Factor importance:
1. shard_count (effect: 8.9, contribution: 64.8%)
2. timeout_multiplier (effect: 2.8, contribution: 20.7%)
3. retry_flaky_count (effect: 2.0, contribution: 14.5%)