← All Use Cases
Box-Behnken Design

Feature Flag Evaluation

Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits

Summary

This experiment investigates feature flag evaluation. Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits.

The design varies 3 factors: cache ttl sec (sec), ranging from 5 to 120, rule complexity score (score), ranging from 1 to 10, and sdk polling interval sec (sec), ranging from 10 to 300. The goal is to optimize 2 responses: evaluation latency us (us) (minimize) and cache hit rate pct (%) (maximize). Fixed conditions held constant across all runs include platform = launchdarkly, sdk = server_side.

A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For evaluation latency us, the most influential factors were sdk polling interval sec (48.8%), cache ttl sec (32.2%), rule complexity score (19.0%). The best observed value was 115.0 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).

For cache hit rate pct, the most influential factors were sdk polling interval sec (58.1%), rule complexity score (30.1%), cache ttl sec (11.8%). The best observed value was 90.6 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
cache_ttl_sec5120sec
rule_complexity_score110score
sdk_polling_interval_sec10300sec

Fixed: platform = launchdarkly, sdk = server_side

Responses

ResponseDirectionUnit
evaluation_latency_us↓ minimizeus
cache_hit_rate_pct↑ maximize%

Configuration

use_cases/84_feature_flag_evaluation/config.json
{ "metadata": { "name": "Feature Flag Evaluation", "description": "Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits" }, "factors": [ { "name": "cache_ttl_sec", "levels": [ "5", "120" ], "type": "continuous", "unit": "sec" }, { "name": "rule_complexity_score", "levels": [ "1", "10" ], "type": "continuous", "unit": "score" }, { "name": "sdk_polling_interval_sec", "levels": [ "10", "300" ], "type": "continuous", "unit": "sec" } ], "fixed_factors": { "platform": "launchdarkly", "sdk": "server_side" }, "responses": [ { "name": "evaluation_latency_us", "optimize": "minimize", "unit": "us" }, { "name": "cache_hit_rate_pct", "optimize": "maximize", "unit": "%" } ], "settings": { "operation": "box_behnken", "test_script": "use_cases/84_feature_flag_evaluation/sim.sh" } }

Experimental Matrix

The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.

Runcache_ttl_secrule_complexity_scoresdk_polling_interval_sec
162.5110
262.55.5155
31205.5300
41205.510
562.55.5155
662.55.5155
755.5300
81201155
962.51300
1012010155
1155.510
1262.510300
1351155
14510155
1562.51010

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/84_feature_flag_evaluation/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/84_feature_flag_evaluation/config.json \ --output use_cases/84_feature_flag_evaluation/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/84_feature_flag_evaluation/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/84_feature_flag_evaluation/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/84_feature_flag_evaluation/config.json \ --output use_cases/84_feature_flag_evaluation/results/report.html

Features Exercised

FeatureValue
Design typebox_behnken
Factor typescontinuous (all 3)
Arg styledouble-dash
Responses2 (evaluation_latency_us ↓, cache_hit_rate_pct ↑)
Total runs15

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: evaluation_latency_us

Top factors: sdk_polling_interval_sec (48.8%), cache_ttl_sec (32.2%), rule_complexity_score (19.0%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
cache_ttl_sec24951.01902475.50951.0010.4093
rule_complexity_score22281.30481140.65240.4610.6463
sdk_polling_interval_sec214428.80487214.40242.9170.1118
LackofFit611376.60481896.1008
PureError24946.0000
Error816322.60482473.0000
Total1437983.73332713.1238

Pareto Chart

Pareto chart for evaluation_latency_us

Main Effects Plot

Main effects plot for evaluation_latency_us

Normal Probability Plot of Effects

Normal probability plot for evaluation_latency_us

Half-Normal Plot of Effects

Half-normal plot for evaluation_latency_us

Model Diagnostics

Model diagnostics for evaluation_latency_us

Response: cache_hit_rate_pct

Top factors: sdk_polling_interval_sec (58.1%), rule_complexity_score (30.1%), cache_ttl_sec (11.8%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
cache_ttl_sec220.642210.32110.2750.7661
rule_complexity_score2164.372282.18612.1930.1740
sdk_polling_interval_sec2635.2204317.61028.4760.0106
LackofFit6619.5691103.2615
PureError274.9400
Error8694.509137.4700
Total141514.7440108.1960

Pareto Chart

Pareto chart for cache_hit_rate_pct

Main Effects Plot

Main effects plot for cache_hit_rate_pct

Normal Probability Plot of Effects

Normal probability plot for cache_hit_rate_pct

Half-Normal Plot of Effects

Half-normal plot for cache_hit_rate_pct

Model Diagnostics

Model diagnostics for cache_hit_rate_pct

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

cache hit rate pct cache ttl sec vs rule complexity score

RSM surface: cache hit rate pct cache ttl sec vs rule complexity score

cache hit rate pct cache ttl sec vs sdk polling interval sec

RSM surface: cache hit rate pct cache ttl sec vs sdk polling interval sec

cache hit rate pct rule complexity score vs sdk polling interval sec

RSM surface: cache hit rate pct rule complexity score vs sdk polling interval sec

evaluation latency us cache ttl sec vs rule complexity score

RSM surface: evaluation latency us cache ttl sec vs rule complexity score

evaluation latency us cache ttl sec vs sdk polling interval sec

RSM surface: evaluation latency us cache ttl sec vs sdk polling interval sec

evaluation latency us rule complexity score vs sdk polling interval sec

RSM surface: evaluation latency us rule complexity score vs sdk polling interval sec

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.9796

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
evaluation_latency_us 1.0
0.9529
115.32 0.9529 115.32 us
cache_hit_rate_pct 1.5
0.9978
92.14 0.9978 92.14 %

Recommended Settings

FactorValue
cache_ttl_sec120 sec
rule_complexity_score2.8 score
sdk_polling_interval_sec300 sec

Source: from RSM model prediction

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
cache_hit_rate_pct92.1490.60-1.54

Top 3 Runs by Desirability

RunDFactor Settings
#40.8940cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300
#10.8861cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300

Model Quality

ResponseType
cache_hit_rate_pct0.8056quadratic

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.9796 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- evaluation_latency_us 1.0 0.9529 115.32 us ↓ cache_hit_rate_pct 1.5 0.9978 92.14 % ↑ Recommended settings: cache_ttl_sec = 120 sec rule_complexity_score = 2.8 score sdk_polling_interval_sec = 300 sec (from RSM model prediction) Trade-off summary: evaluation_latency_us: 115.32 (best observed: 115.00, sacrifice: +0.32) cache_hit_rate_pct: 92.14 (best observed: 90.60, sacrifice: -1.54) Model quality: evaluation_latency_us: R² = 0.3533 (linear) cache_hit_rate_pct: R² = 0.8056 (quadratic) Top 3 observed runs by overall desirability: 1. Run #8 (D=0.9545): cache_ttl_sec=120, rule_complexity_score=1, sdk_polling_interval_sec=155 2. Run #4 (D=0.8940): cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300 3. Run #1 (D=0.8861): cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300

Full Analysis Output

doe analyze
=== Main Effects: evaluation_latency_us === Factor Effect Std Error % Contribution -------------------------------------------------------------- sdk_polling_interval_sec 75.0357 13.4490 48.8% cache_ttl_sec 49.5000 13.4490 32.2% rule_complexity_score 29.2143 13.4490 19.0% === ANOVA Table: evaluation_latency_us === Source DF SS MS F p-value ----------------------------------------------------------------------------- cache_ttl_sec 2 4951.0190 2475.5095 1.001 0.4093 rule_complexity_score 2 2281.3048 1140.6524 0.461 0.6463 sdk_polling_interval_sec 2 14428.8048 7214.4024 2.917 0.1118 Lack of Fit 6 11376.6048 1896.1008 0.767 0.6614 Pure Error 2 4946.0000 2473.0000 Error 8 16322.6048 2473.0000 Total 14 37983.7333 2713.1238 === Summary Statistics: evaluation_latency_us === cache_ttl_sec: Level N Mean Std Min Max ------------------------------------------------------------ 120 4 149.0000 21.1818 125.0000 174.0000 5 4 198.5000 81.4391 115.0000 295.0000 62.5 7 177.4286 44.3278 123.0000 232.0000 rule_complexity_score: Level N Mean Std Min Max ------------------------------------------------------------ 1 4 158.5000 28.8964 129.0000 198.0000 10 4 171.0000 53.2103 115.0000 232.0000 5.5 7 187.7143 64.1657 123.0000 295.0000 sdk_polling_interval_sec: Level N Mean Std Min Max ------------------------------------------------------------ 10 4 171.2500 53.3065 125.0000 234.0000 155 7 149.7143 33.5446 115.0000 218.0000 300 4 224.7500 52.5317 174.0000 295.0000 === Main Effects: cache_hit_rate_pct === Factor Effect Std Error % Contribution -------------------------------------------------------------- sdk_polling_interval_sec 15.5893 2.6857 58.1% rule_complexity_score 8.0750 2.6857 30.1% cache_ttl_sec 3.1750 2.6857 11.8% === ANOVA Table: cache_hit_rate_pct === Source DF SS MS F p-value ----------------------------------------------------------------------------- cache_ttl_sec 2 20.6422 10.3211 0.275 0.7661 rule_complexity_score 2 164.3722 82.1861 2.193 0.1740 sdk_polling_interval_sec 2 635.2204 317.6102 8.476 0.0106 Lack of Fit 6 619.5691 103.2615 2.756 0.2900 Pure Error 2 74.9400 37.4700 Error 8 694.5091 37.4700 Total 14 1514.7440 108.1960 === Summary Statistics: cache_hit_rate_pct === cache_ttl_sec: Level N Mean Std Min Max ------------------------------------------------------------ 120 4 78.0000 9.1706 65.9000 88.2000 5 4 74.8250 13.9170 58.1000 90.6000 62.5 7 76.7714 10.4941 59.1000 87.6000 rule_complexity_score: Level N Mean Std Min Max ------------------------------------------------------------ 1 4 82.0250 3.8003 79.1000 87.6000 10 4 73.9500 13.7294 59.1000 90.6000 5.5 7 74.9714 11.1172 58.1000 88.2000 sdk_polling_interval_sec: Level N Mean Std Min Max ------------------------------------------------------------ 10 4 78.3250 11.1222 67.3000 88.2000 155 7 81.6143 5.3757 75.5000 90.6000 300 4 66.0250 10.5677 58.1000 81.0000

Optimization Recommendations

doe optimize
=== Optimization: evaluation_latency_us === Direction: minimize Best observed run: #8 cache_ttl_sec = 62.5 rule_complexity_score = 5.5 sdk_polling_interval_sec = 155 Value: 115.0 RSM Model (linear, R² = 0.1396, Adj R² = -0.0950): Coefficients: intercept +175.4667 cache_ttl_sec -5.3750 rule_complexity_score +5.2500 sdk_polling_interval_sec -24.6250 RSM Model (quadratic, R² = 0.3454, Adj R² = -0.8329): Coefficients: intercept +158.6667 cache_ttl_sec -5.3750 rule_complexity_score +5.2500 sdk_polling_interval_sec -24.6250 cache_ttl_sec*rule_complexity_score +2.7500 cache_ttl_sec*sdk_polling_interval_sec -15.5000 rule_complexity_score*sdk_polling_interval_sec -10.7500 cache_ttl_sec^2 -16.5833 rule_complexity_score^2 +34.6667 sdk_polling_interval_sec^2 +13.4167 Curvature analysis: rule_complexity_score coef=+34.6667 convex (has a minimum) cache_ttl_sec coef=-16.5833 concave (has a maximum) sdk_polling_interval_sec coef=+13.4167 convex (has a minimum) Notable interactions: cache_ttl_sec*sdk_polling_interval_sec coef=-15.5000 (antagonistic) rule_complexity_score*sdk_polling_interval_sec coef=-10.7500 (antagonistic) cache_ttl_sec*rule_complexity_score coef=+2.7500 (synergistic) Predicted optimum (from linear model, at observed points): cache_ttl_sec = 5 rule_complexity_score = 5.5 sdk_polling_interval_sec = 10 Predicted value: 205.4667 Surface optimum (via L-BFGS-B, linear model): cache_ttl_sec = 120 rule_complexity_score = 1 sdk_polling_interval_sec = 300 Predicted value: 140.2167 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. sdk_polling_interval_sec (effect: 49.2, contribution: 42.9%) 2. rule_complexity_score (effect: 40.1, contribution: 35.0%) 3. cache_ttl_sec (effect: 25.4, contribution: 22.1%) === Optimization: cache_hit_rate_pct === Direction: maximize Best observed run: #8 cache_ttl_sec = 62.5 rule_complexity_score = 5.5 sdk_polling_interval_sec = 155 Value: 90.6 RSM Model (linear, R² = 0.2346, Adj R² = 0.0258): Coefficients: intercept +76.5800 cache_ttl_sec +1.8625 rule_complexity_score +0.7875 sdk_polling_interval_sec +6.3500 RSM Model (quadratic, R² = 0.4849, Adj R² = -0.4422): Coefficients: intercept +79.1000 cache_ttl_sec +1.8625 rule_complexity_score +0.7875 sdk_polling_interval_sec +6.3500 cache_ttl_sec*rule_complexity_score -0.7500 cache_ttl_sec*sdk_polling_interval_sec +4.9750 rule_complexity_score*sdk_polling_interval_sec +1.1250 cache_ttl_sec^2 +4.8000 rule_complexity_score^2 -3.9500 sdk_polling_interval_sec^2 -5.5750 Curvature analysis: sdk_polling_interval_sec coef=-5.5750 concave (has a maximum) cache_ttl_sec coef=+4.8000 convex (has a minimum) rule_complexity_score coef=-3.9500 concave (has a maximum) Notable interactions: cache_ttl_sec*sdk_polling_interval_sec coef=+4.9750 (synergistic) rule_complexity_score*sdk_polling_interval_sec coef=+1.1250 (synergistic) cache_ttl_sec*rule_complexity_score coef=-0.7500 (antagonistic) Predicted optimum (from linear model, at observed points): cache_ttl_sec = 120 rule_complexity_score = 5.5 sdk_polling_interval_sec = 300 Predicted value: 84.7925 Surface optimum (via L-BFGS-B, linear model): cache_ttl_sec = 120 rule_complexity_score = 10 sdk_polling_interval_sec = 300 Predicted value: 85.5800 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. sdk_polling_interval_sec (effect: 12.7, contribution: 51.4%) 2. cache_ttl_sec (effect: 7.3, contribution: 29.7%) 3. rule_complexity_score (effect: 4.7, contribution: 18.9%)
← Previous: Log Aggregation Pipeline All Use Cases →