Summary
This experiment investigates feature flag evaluation. Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits.
The design varies 3 factors: cache ttl sec (sec), ranging from 5 to 120, rule complexity score (score), ranging from 1 to 10, and sdk polling interval sec (sec), ranging from 10 to 300. The goal is to optimize 2 responses: evaluation latency us (us) (minimize) and cache hit rate pct (%) (maximize). Fixed conditions held constant across all runs include platform = launchdarkly, sdk = server_side.
A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.
Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.
Key Findings
For evaluation latency us, the most influential factors were sdk polling interval sec (48.8%), cache ttl sec (32.2%), rule complexity score (19.0%). The best observed value was 115.0 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).
For cache hit rate pct, the most influential factors were sdk polling interval sec (58.1%), rule complexity score (30.1%), cache ttl sec (11.8%). The best observed value was 90.6 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).
Recommended Next Steps
- Run confirmation experiments at the predicted optimal settings to validate the model.
- Consider whether any fixed factors should be varied in a future study.
Experimental Setup
Factors
| Factor | Low | High | Unit |
cache_ttl_sec | 5 | 120 | sec |
rule_complexity_score | 1 | 10 | score |
sdk_polling_interval_sec | 10 | 300 | sec |
Fixed: platform = launchdarkly, sdk = server_side
Responses
| Response | Direction | Unit |
evaluation_latency_us | ↓ minimize | us |
cache_hit_rate_pct | ↑ maximize | % |
Configuration
{
"metadata": {
"name": "Feature Flag Evaluation",
"description": "Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits"
},
"factors": [
{
"name": "cache_ttl_sec",
"levels": [
"5",
"120"
],
"type": "continuous",
"unit": "sec"
},
{
"name": "rule_complexity_score",
"levels": [
"1",
"10"
],
"type": "continuous",
"unit": "score"
},
{
"name": "sdk_polling_interval_sec",
"levels": [
"10",
"300"
],
"type": "continuous",
"unit": "sec"
}
],
"fixed_factors": {
"platform": "launchdarkly",
"sdk": "server_side"
},
"responses": [
{
"name": "evaluation_latency_us",
"optimize": "minimize",
"unit": "us"
},
{
"name": "cache_hit_rate_pct",
"optimize": "maximize",
"unit": "%"
}
],
"settings": {
"operation": "box_behnken",
"test_script": "use_cases/84_feature_flag_evaluation/sim.sh"
}
}
Experimental Matrix
The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.
| Run | cache_ttl_sec | rule_complexity_score | sdk_polling_interval_sec |
| 1 | 62.5 | 1 | 10 |
| 2 | 62.5 | 5.5 | 155 |
| 3 | 120 | 5.5 | 300 |
| 4 | 120 | 5.5 | 10 |
| 5 | 62.5 | 5.5 | 155 |
| 6 | 62.5 | 5.5 | 155 |
| 7 | 5 | 5.5 | 300 |
| 8 | 120 | 1 | 155 |
| 9 | 62.5 | 1 | 300 |
| 10 | 120 | 10 | 155 |
| 11 | 5 | 5.5 | 10 |
| 12 | 62.5 | 10 | 300 |
| 13 | 5 | 1 | 155 |
| 14 | 5 | 10 | 155 |
| 15 | 62.5 | 10 | 10 |
Step-by-Step Workflow
1
Preview the design
$ doe info --config use_cases/84_feature_flag_evaluation/config.json
2
Generate the runner script
$ doe generate --config use_cases/84_feature_flag_evaluation/config.json \
--output use_cases/84_feature_flag_evaluation/results/run.sh --seed 42
3
Execute the experiments
$ bash use_cases/84_feature_flag_evaluation/results/run.sh
4
Analyze results
$ doe analyze --config use_cases/84_feature_flag_evaluation/config.json
5
Get optimization recommendations
$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json
6
Multi-objective optimization
With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.
$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json --multi
7
Generate the HTML report
$ doe report --config use_cases/84_feature_flag_evaluation/config.json \
--output use_cases/84_feature_flag_evaluation/results/report.html
Features Exercised
| Feature | Value |
| Design type | box_behnken |
| Factor types | continuous (all 3) |
| Arg style | double-dash |
| Responses | 2 (evaluation_latency_us ↓, cache_hit_rate_pct ↑) |
| Total runs | 15 |
Analysis Results
Generated from actual experiment runs using the DOE Helper Tool.
Response: evaluation_latency_us
Top factors: sdk_polling_interval_sec (48.8%), cache_ttl_sec (32.2%), rule_complexity_score (19.0%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| cache_ttl_sec | 2 | 4951.0190 | 2475.5095 | 1.001 | 0.4093 |
| rule_complexity_score | 2 | 2281.3048 | 1140.6524 | 0.461 | 0.6463 |
| sdk_polling_interval_sec | 2 | 14428.8048 | 7214.4024 | 2.917 | 0.1118 |
| Lack | of | Fit | 6 | 11376.6048 | 1896.1008 |
| Pure | Error | 2 | 4946.0000 | | |
| Error | 8 | 16322.6048 | 2473.0000 | | |
| Total | 14 | 37983.7333 | 2713.1238 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response: cache_hit_rate_pct
Top factors: sdk_polling_interval_sec (58.1%), rule_complexity_score (30.1%), cache_ttl_sec (11.8%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| cache_ttl_sec | 2 | 20.6422 | 10.3211 | 0.275 | 0.7661 |
| rule_complexity_score | 2 | 164.3722 | 82.1861 | 2.193 | 0.1740 |
| sdk_polling_interval_sec | 2 | 635.2204 | 317.6102 | 8.476 | 0.0106 |
| Lack | of | Fit | 6 | 619.5691 | 103.2615 |
| Pure | Error | 2 | 74.9400 | | |
| Error | 8 | 694.5091 | 37.4700 | | |
| Total | 14 | 1514.7440 | 108.1960 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
cache hit rate pct cache ttl sec vs rule complexity score
cache hit rate pct cache ttl sec vs sdk polling interval sec
cache hit rate pct rule complexity score vs sdk polling interval sec
evaluation latency us cache ttl sec vs rule complexity score
evaluation latency us cache ttl sec vs sdk polling interval sec
evaluation latency us rule complexity score vs sdk polling interval sec
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.9796
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
evaluation_latency_us |
1.0 |
|
115.32 0.9529 115.32 us |
↓ |
cache_hit_rate_pct |
1.5 |
|
92.14 0.9978 92.14 % |
↑ |
Recommended Settings
| Factor | Value |
cache_ttl_sec | 120 sec |
rule_complexity_score | 2.8 score |
sdk_polling_interval_sec | 300 sec |
Source: from RSM model prediction
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
cache_hit_rate_pct | 92.14 | 90.60 | -1.54 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #4 | 0.8940 | cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300 |
| #1 | 0.8861 | cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300 |
Model Quality
| Response | R² | Type |
cache_hit_rate_pct | 0.8056 | quadratic |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.9796
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
evaluation_latency_us 1.0 0.9529 115.32 us ↓
cache_hit_rate_pct 1.5 0.9978 92.14 % ↑
Recommended settings:
cache_ttl_sec = 120 sec
rule_complexity_score = 2.8 score
sdk_polling_interval_sec = 300 sec
(from RSM model prediction)
Trade-off summary:
evaluation_latency_us: 115.32 (best observed: 115.00, sacrifice: +0.32)
cache_hit_rate_pct: 92.14 (best observed: 90.60, sacrifice: -1.54)
Model quality:
evaluation_latency_us: R² = 0.3533 (linear)
cache_hit_rate_pct: R² = 0.8056 (quadratic)
Top 3 observed runs by overall desirability:
1. Run #8 (D=0.9545): cache_ttl_sec=120, rule_complexity_score=1, sdk_polling_interval_sec=155
2. Run #4 (D=0.8940): cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300
3. Run #1 (D=0.8861): cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300
Full Analysis Output
=== Main Effects: evaluation_latency_us ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
sdk_polling_interval_sec 75.0357 13.4490 48.8%
cache_ttl_sec 49.5000 13.4490 32.2%
rule_complexity_score 29.2143 13.4490 19.0%
=== ANOVA Table: evaluation_latency_us ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
cache_ttl_sec 2 4951.0190 2475.5095 1.001 0.4093
rule_complexity_score 2 2281.3048 1140.6524 0.461 0.6463
sdk_polling_interval_sec 2 14428.8048 7214.4024 2.917 0.1118
Lack of Fit 6 11376.6048 1896.1008 0.767 0.6614
Pure Error 2 4946.0000 2473.0000
Error 8 16322.6048 2473.0000
Total 14 37983.7333 2713.1238
=== Summary Statistics: evaluation_latency_us ===
cache_ttl_sec:
Level N Mean Std Min Max
------------------------------------------------------------
120 4 149.0000 21.1818 125.0000 174.0000
5 4 198.5000 81.4391 115.0000 295.0000
62.5 7 177.4286 44.3278 123.0000 232.0000
rule_complexity_score:
Level N Mean Std Min Max
------------------------------------------------------------
1 4 158.5000 28.8964 129.0000 198.0000
10 4 171.0000 53.2103 115.0000 232.0000
5.5 7 187.7143 64.1657 123.0000 295.0000
sdk_polling_interval_sec:
Level N Mean Std Min Max
------------------------------------------------------------
10 4 171.2500 53.3065 125.0000 234.0000
155 7 149.7143 33.5446 115.0000 218.0000
300 4 224.7500 52.5317 174.0000 295.0000
=== Main Effects: cache_hit_rate_pct ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
sdk_polling_interval_sec 15.5893 2.6857 58.1%
rule_complexity_score 8.0750 2.6857 30.1%
cache_ttl_sec 3.1750 2.6857 11.8%
=== ANOVA Table: cache_hit_rate_pct ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
cache_ttl_sec 2 20.6422 10.3211 0.275 0.7661
rule_complexity_score 2 164.3722 82.1861 2.193 0.1740
sdk_polling_interval_sec 2 635.2204 317.6102 8.476 0.0106
Lack of Fit 6 619.5691 103.2615 2.756 0.2900
Pure Error 2 74.9400 37.4700
Error 8 694.5091 37.4700
Total 14 1514.7440 108.1960
=== Summary Statistics: cache_hit_rate_pct ===
cache_ttl_sec:
Level N Mean Std Min Max
------------------------------------------------------------
120 4 78.0000 9.1706 65.9000 88.2000
5 4 74.8250 13.9170 58.1000 90.6000
62.5 7 76.7714 10.4941 59.1000 87.6000
rule_complexity_score:
Level N Mean Std Min Max
------------------------------------------------------------
1 4 82.0250 3.8003 79.1000 87.6000
10 4 73.9500 13.7294 59.1000 90.6000
5.5 7 74.9714 11.1172 58.1000 88.2000
sdk_polling_interval_sec:
Level N Mean Std Min Max
------------------------------------------------------------
10 4 78.3250 11.1222 67.3000 88.2000
155 7 81.6143 5.3757 75.5000 90.6000
300 4 66.0250 10.5677 58.1000 81.0000
Optimization Recommendations
=== Optimization: evaluation_latency_us ===
Direction: minimize
Best observed run: #8
cache_ttl_sec = 62.5
rule_complexity_score = 5.5
sdk_polling_interval_sec = 155
Value: 115.0
RSM Model (linear, R² = 0.1396, Adj R² = -0.0950):
Coefficients:
intercept +175.4667
cache_ttl_sec -5.3750
rule_complexity_score +5.2500
sdk_polling_interval_sec -24.6250
RSM Model (quadratic, R² = 0.3454, Adj R² = -0.8329):
Coefficients:
intercept +158.6667
cache_ttl_sec -5.3750
rule_complexity_score +5.2500
sdk_polling_interval_sec -24.6250
cache_ttl_sec*rule_complexity_score +2.7500
cache_ttl_sec*sdk_polling_interval_sec -15.5000
rule_complexity_score*sdk_polling_interval_sec -10.7500
cache_ttl_sec^2 -16.5833
rule_complexity_score^2 +34.6667
sdk_polling_interval_sec^2 +13.4167
Curvature analysis:
rule_complexity_score coef=+34.6667 convex (has a minimum)
cache_ttl_sec coef=-16.5833 concave (has a maximum)
sdk_polling_interval_sec coef=+13.4167 convex (has a minimum)
Notable interactions:
cache_ttl_sec*sdk_polling_interval_sec coef=-15.5000 (antagonistic)
rule_complexity_score*sdk_polling_interval_sec coef=-10.7500 (antagonistic)
cache_ttl_sec*rule_complexity_score coef=+2.7500 (synergistic)
Predicted optimum (from linear model, at observed points):
cache_ttl_sec = 5
rule_complexity_score = 5.5
sdk_polling_interval_sec = 10
Predicted value: 205.4667
Surface optimum (via L-BFGS-B, linear model):
cache_ttl_sec = 120
rule_complexity_score = 1
sdk_polling_interval_sec = 300
Predicted value: 140.2167
Model quality: Weak fit — consider adding center points or using a different design.
Factor importance:
1. sdk_polling_interval_sec (effect: 49.2, contribution: 42.9%)
2. rule_complexity_score (effect: 40.1, contribution: 35.0%)
3. cache_ttl_sec (effect: 25.4, contribution: 22.1%)
=== Optimization: cache_hit_rate_pct ===
Direction: maximize
Best observed run: #8
cache_ttl_sec = 62.5
rule_complexity_score = 5.5
sdk_polling_interval_sec = 155
Value: 90.6
RSM Model (linear, R² = 0.2346, Adj R² = 0.0258):
Coefficients:
intercept +76.5800
cache_ttl_sec +1.8625
rule_complexity_score +0.7875
sdk_polling_interval_sec +6.3500
RSM Model (quadratic, R² = 0.4849, Adj R² = -0.4422):
Coefficients:
intercept +79.1000
cache_ttl_sec +1.8625
rule_complexity_score +0.7875
sdk_polling_interval_sec +6.3500
cache_ttl_sec*rule_complexity_score -0.7500
cache_ttl_sec*sdk_polling_interval_sec +4.9750
rule_complexity_score*sdk_polling_interval_sec +1.1250
cache_ttl_sec^2 +4.8000
rule_complexity_score^2 -3.9500
sdk_polling_interval_sec^2 -5.5750
Curvature analysis:
sdk_polling_interval_sec coef=-5.5750 concave (has a maximum)
cache_ttl_sec coef=+4.8000 convex (has a minimum)
rule_complexity_score coef=-3.9500 concave (has a maximum)
Notable interactions:
cache_ttl_sec*sdk_polling_interval_sec coef=+4.9750 (synergistic)
rule_complexity_score*sdk_polling_interval_sec coef=+1.1250 (synergistic)
cache_ttl_sec*rule_complexity_score coef=-0.7500 (antagonistic)
Predicted optimum (from linear model, at observed points):
cache_ttl_sec = 120
rule_complexity_score = 5.5
sdk_polling_interval_sec = 300
Predicted value: 84.7925
Surface optimum (via L-BFGS-B, linear model):
cache_ttl_sec = 120
rule_complexity_score = 10
sdk_polling_interval_sec = 300
Predicted value: 85.5800
Model quality: Weak fit — consider adding center points or using a different design.
Factor importance:
1. sdk_polling_interval_sec (effect: 12.7, contribution: 51.4%)
2. cache_ttl_sec (effect: 7.3, contribution: 29.7%)
3. rule_complexity_score (effect: 4.7, contribution: 18.9%)