Feature Flag Evaluation

Summary

This experiment investigates feature flag evaluation. Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits.

The design varies 3 factors: cache ttl sec (sec), ranging from 5 to 120, rule complexity score (score), ranging from 1 to 10, and sdk polling interval sec (sec), ranging from 10 to 300. The goal is to optimize 2 responses: evaluation latency us (us) (minimize) and cache hit rate pct (%) (maximize). Fixed conditions held constant across all runs include platform = launchdarkly, sdk = server_side.

A Box-Behnken design was chosen because it efficiently fits quadratic models with 3 continuous factors while avoiding extreme corner combinations — requiring only 15 runs instead of the 8 needed for a full factorial at two levels.

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For evaluation latency us, the most influential factors were sdk polling interval sec (48.8%), cache ttl sec (32.2%), rule complexity score (19.0%). The best observed value was 115.0 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).

For cache hit rate pct, the most influential factors were sdk polling interval sec (58.1%), rule complexity score (30.1%), cache ttl sec (11.8%). The best observed value was 90.6 (at cache ttl sec = 62.5, rule complexity score = 5.5, sdk polling interval sec = 155).

Recommended Next Steps

Run confirmation experiments at the predicted optimal settings to validate the model.
Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Low	High	Unit
`cache_ttl_sec`	5	120	sec
`rule_complexity_score`	1	10	score
`sdk_polling_interval_sec`	10	300	sec

Fixed: platform = launchdarkly, sdk = server_side

Responses

Response	Direction	Unit
`evaluation_latency_us`	↓ minimize	us
`cache_hit_rate_pct`	↑ maximize	%

Configuration

use_cases/84_feature_flag_evaluation/config.json

{
  "metadata": {
    "name": "Feature Flag Evaluation",
    "description": "Box-Behnken design to tune cache TTL, rule complexity, and SDK polling interval for evaluation latency and cache hits"
  },
  "factors": [
    {
      "name": "cache_ttl_sec",
      "levels": [
        "5",
        "120"
      ],
      "type": "continuous",
      "unit": "sec"
    },
    {
      "name": "rule_complexity_score",
      "levels": [
        "1",
        "10"
      ],
      "type": "continuous",
      "unit": "score"
    },
    {
      "name": "sdk_polling_interval_sec",
      "levels": [
        "10",
        "300"
      ],
      "type": "continuous",
      "unit": "sec"
    }
  ],
  "fixed_factors": {
    "platform": "launchdarkly",
    "sdk": "server_side"
  },
  "responses": [
    {
      "name": "evaluation_latency_us",
      "optimize": "minimize",
      "unit": "us"
    },
    {
      "name": "cache_hit_rate_pct",
      "optimize": "maximize",
      "unit": "%"
    }
  ],
  "settings": {
    "operation": "box_behnken",
    "test_script": "use_cases/84_feature_flag_evaluation/sim.sh"
  }
}

Experimental Matrix

The Box-Behnken Design produces 15 runs. Each row is one experiment with specific factor settings.

Run	`cache_ttl_sec`	`rule_complexity_score`	`sdk_polling_interval_sec`
1	62.5	1	10
2	62.5	5.5	155
3	120	5.5	300
4	120	5.5	10
5	62.5	5.5	155
6	62.5	5.5	155
7	5	5.5	300
8	120	1	155
9	62.5	1	300
10	120	10	155
11	5	5.5	10
12	62.5	10	300
13	5	1	155
14	5	10	155
15	62.5	10	10

Step-by-Step Workflow

1

Preview the design

Terminal

$ doe info --config use_cases/84_feature_flag_evaluation/config.json

2

Generate the runner script

Terminal

$ doe generate --config use_cases/84_feature_flag_evaluation/config.json \
    --output use_cases/84_feature_flag_evaluation/results/run.sh --seed 42

3

Execute the experiments

Terminal

$ bash use_cases/84_feature_flag_evaluation/results/run.sh

4

Analyze results

Terminal

$ doe analyze --config use_cases/84_feature_flag_evaluation/config.json

5

Get optimization recommendations

Terminal

$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json

6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal

$ doe optimize --config use_cases/84_feature_flag_evaluation/config.json --multi

7

Generate the HTML report

Terminal

$ doe report --config use_cases/84_feature_flag_evaluation/config.json \
    --output use_cases/84_feature_flag_evaluation/results/report.html

Features Exercised

Feature	Value
Design type	`box_behnken`
Factor types	`continuous` (all 3)
Arg style	`double-dash`
Responses	2 (evaluation_latency_us ↓, cache_hit_rate_pct ↑)
Total runs	15

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: evaluation_latency_us

Top factors: sdk_polling_interval_sec (48.8%), cache_ttl_sec (32.2%), rule_complexity_score (19.0%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
cache_ttl_sec	2	4951.0190	2475.5095	1.001	0.4093
rule_complexity_score	2	2281.3048	1140.6524	0.461	0.6463
sdk_polling_interval_sec	2	14428.8048	7214.4024	2.917	0.1118
Lack	of	Fit	6	11376.6048	1896.1008
Pure	Error	2	4946.0000
Error	8	16322.6048	2473.0000
Total	14	37983.7333	2713.1238

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for evaluation_latency_us

Half-Normal Plot of Effects

Half-normal plot for evaluation_latency_us

Model Diagnostics

Response: cache_hit_rate_pct

Top factors: sdk_polling_interval_sec (58.1%), rule_complexity_score (30.1%), cache_ttl_sec (11.8%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
cache_ttl_sec	2	20.6422	10.3211	0.275	0.7661
rule_complexity_score	2	164.3722	82.1861	2.193	0.1740
sdk_polling_interval_sec	2	635.2204	317.6102	8.476	0.0106
Lack	of	Fit	6	619.5691	103.2615
Pure	Error	2	74.9400
Error	8	694.5091	37.4700
Total	14	1514.7440	108.1960

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for cache_hit_rate_pct

Half-Normal Plot of Effects

Model Diagnostics

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

cache hit rate pct cache ttl sec vs rule complexity score

cache hit rate pct cache ttl sec vs sdk polling interval sec

cache hit rate pct rule complexity score vs sdk polling interval sec

evaluation latency us cache ttl sec vs rule complexity score

evaluation latency us cache ttl sec vs sdk polling interval sec

evaluation latency us rule complexity score vs sdk polling interval sec

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.9796

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`evaluation_latency_us`	1.0	0.9529	115.32 0.9529 115.32 us	↓
`cache_hit_rate_pct`	1.5	0.9978	92.14 0.9978 92.14 %	↑

Recommended Settings

Factor	Value
`cache_ttl_sec`	120 sec
`rule_complexity_score`	2.8 score
`sdk_polling_interval_sec`	300 sec

Source: from RSM model prediction

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`cache_hit_rate_pct`	92.14	90.60	-1.54

Top 3 Runs by Desirability

Run	D	Factor Settings
#4	0.8940	cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300
#1	0.8861	cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300

Model Quality

Response	R²	Type
`cache_hit_rate_pct`	0.8056	quadratic

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.9796

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
evaluation_latency_us        1.0       0.9529      115.32 us   ↓
cache_hit_rate_pct           1.5       0.9978       92.14 %   ↑

Recommended settings:
  cache_ttl_sec = 120 sec
  rule_complexity_score = 2.8 score
  sdk_polling_interval_sec = 300 sec
  (from RSM model prediction)

Trade-off summary:
  evaluation_latency_us: 115.32 (best observed: 115.00, sacrifice: +0.32)
  cache_hit_rate_pct: 92.14 (best observed: 90.60, sacrifice: -1.54)

Model quality:
  evaluation_latency_us: R² = 0.3533 (linear)
  cache_hit_rate_pct: R² = 0.8056 (quadratic)

Top 3 observed runs by overall desirability:
  1. Run #8 (D=0.9545): cache_ttl_sec=120, rule_complexity_score=1, sdk_polling_interval_sec=155
  2. Run #4 (D=0.8940): cache_ttl_sec=120, rule_complexity_score=5.5, sdk_polling_interval_sec=300
  3. Run #1 (D=0.8861): cache_ttl_sec=62.5, rule_complexity_score=10, sdk_polling_interval_sec=300

Full Analysis Output

doe analyze
=== Main Effects: evaluation_latency_us ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
sdk_polling_interval_sec    75.0357      13.4490            48.8%
cache_ttl_sec           49.5000      13.4490            32.2%
rule_complexity_score    29.2143      13.4490            19.0%

=== ANOVA Table: evaluation_latency_us ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
cache_ttl_sec                2    4951.0190    2475.5095      1.001     0.4093
rule_complexity_score        2    2281.3048    1140.6524      0.461     0.6463
sdk_polling_interval_sec     2   14428.8048    7214.4024      2.917     0.1118
Lack of Fit                  6   11376.6048    1896.1008      0.767     0.6614
Pure Error                   2    4946.0000    2473.0000
Error                        8   16322.6048    2473.0000
Total                       14   37983.7333    2713.1238

=== Summary Statistics: evaluation_latency_us ===

cache_ttl_sec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  120                 4   149.0000    21.1818   125.0000   174.0000
  5                   4   198.5000    81.4391   115.0000   295.0000
  62.5                7   177.4286    44.3278   123.0000   232.0000

rule_complexity_score:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   4   158.5000    28.8964   129.0000   198.0000
  10                  4   171.0000    53.2103   115.0000   232.0000
  5.5                 7   187.7143    64.1657   123.0000   295.0000

sdk_polling_interval_sec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  10                  4   171.2500    53.3065   125.0000   234.0000
  155                 7   149.7143    33.5446   115.0000   218.0000
  300                 4   224.7500    52.5317   174.0000   295.0000

=== Main Effects: cache_hit_rate_pct ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
sdk_polling_interval_sec    15.5893       2.6857            58.1%
rule_complexity_score     8.0750       2.6857            30.1%
cache_ttl_sec            3.1750       2.6857            11.8%

=== ANOVA Table: cache_hit_rate_pct ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
cache_ttl_sec                2      20.6422      10.3211      0.275     0.7661
rule_complexity_score        2     164.3722      82.1861      2.193     0.1740
sdk_polling_interval_sec     2     635.2204     317.6102      8.476     0.0106
Lack of Fit                  6     619.5691     103.2615      2.756     0.2900
Pure Error                   2      74.9400      37.4700
Error                        8     694.5091      37.4700
Total                       14    1514.7440     108.1960

=== Summary Statistics: cache_hit_rate_pct ===

cache_ttl_sec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  120                 4    78.0000     9.1706    65.9000    88.2000
  5                   4    74.8250    13.9170    58.1000    90.6000
  62.5                7    76.7714    10.4941    59.1000    87.6000

rule_complexity_score:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1                   4    82.0250     3.8003    79.1000    87.6000
  10                  4    73.9500    13.7294    59.1000    90.6000
  5.5                 7    74.9714    11.1172    58.1000    88.2000

sdk_polling_interval_sec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  10                  4    78.3250    11.1222    67.3000    88.2000
  155                 7    81.6143     5.3757    75.5000    90.6000
  300                 4    66.0250    10.5677    58.1000    81.0000

Optimization Recommendations

doe optimize
=== Optimization: evaluation_latency_us ===
Direction: minimize

Best observed run: #8
  cache_ttl_sec = 62.5
  rule_complexity_score = 5.5
  sdk_polling_interval_sec = 155
  Value: 115.0

RSM Model (linear, R² = 0.1396, Adj R² = -0.0950):
  Coefficients:
    intercept                      +175.4667
    cache_ttl_sec                  -5.3750
    rule_complexity_score          +5.2500
    sdk_polling_interval_sec       -24.6250

RSM Model (quadratic, R² = 0.3454, Adj R² = -0.8329):
  Coefficients:
    intercept                      +158.6667
    cache_ttl_sec                  -5.3750
    rule_complexity_score          +5.2500
    sdk_polling_interval_sec       -24.6250
    cache_ttl_sec*rule_complexity_score +2.7500
    cache_ttl_sec*sdk_polling_interval_sec -15.5000
    rule_complexity_score*sdk_polling_interval_sec -10.7500
    cache_ttl_sec^2                -16.5833
    rule_complexity_score^2        +34.6667
    sdk_polling_interval_sec^2     +13.4167

  Curvature analysis:
    rule_complexity_score          coef=+34.6667  convex (has a minimum)
    cache_ttl_sec                  coef=-16.5833  concave (has a maximum)
    sdk_polling_interval_sec       coef=+13.4167  convex (has a minimum)

  Notable interactions:
    cache_ttl_sec*sdk_polling_interval_sec coef=-15.5000  (antagonistic)
    rule_complexity_score*sdk_polling_interval_sec coef=-10.7500  (antagonistic)
    cache_ttl_sec*rule_complexity_score coef=+2.7500  (synergistic)

  Predicted optimum (from linear model, at observed points):
    cache_ttl_sec = 5
    rule_complexity_score = 5.5
    sdk_polling_interval_sec = 10
    Predicted value: 205.4667

  Surface optimum (via L-BFGS-B, linear model):
    cache_ttl_sec = 120
    rule_complexity_score = 1
    sdk_polling_interval_sec = 300
    Predicted value: 140.2167

  Model quality: Weak fit — consider adding center points or using a different design.

Factor importance:
  1. sdk_polling_interval_sec  (effect: 49.2, contribution: 42.9%)
  2. rule_complexity_score  (effect: 40.1, contribution: 35.0%)
  3. cache_ttl_sec  (effect: 25.4, contribution: 22.1%)

=== Optimization: cache_hit_rate_pct ===
Direction: maximize

Best observed run: #8
  cache_ttl_sec = 62.5
  rule_complexity_score = 5.5
  sdk_polling_interval_sec = 155
  Value: 90.6

RSM Model (linear, R² = 0.2346, Adj R² = 0.0258):
  Coefficients:
    intercept                      +76.5800
    cache_ttl_sec                  +1.8625
    rule_complexity_score          +0.7875
    sdk_polling_interval_sec       +6.3500

RSM Model (quadratic, R² = 0.4849, Adj R² = -0.4422):
  Coefficients:
    intercept                      +79.1000
    cache_ttl_sec                  +1.8625
    rule_complexity_score          +0.7875
    sdk_polling_interval_sec       +6.3500
    cache_ttl_sec*rule_complexity_score -0.7500
    cache_ttl_sec*sdk_polling_interval_sec +4.9750
    rule_complexity_score*sdk_polling_interval_sec +1.1250
    cache_ttl_sec^2                +4.8000
    rule_complexity_score^2        -3.9500
    sdk_polling_interval_sec^2     -5.5750

  Curvature analysis:
    sdk_polling_interval_sec       coef=-5.5750  concave (has a maximum)
    cache_ttl_sec                  coef=+4.8000  convex (has a minimum)
    rule_complexity_score          coef=-3.9500  concave (has a maximum)

  Notable interactions:
    cache_ttl_sec*sdk_polling_interval_sec coef=+4.9750  (synergistic)
    rule_complexity_score*sdk_polling_interval_sec coef=+1.1250  (synergistic)
    cache_ttl_sec*rule_complexity_score coef=-0.7500  (antagonistic)

  Predicted optimum (from linear model, at observed points):
    cache_ttl_sec = 120
    rule_complexity_score = 5.5
    sdk_polling_interval_sec = 300
    Predicted value: 84.7925

  Surface optimum (via L-BFGS-B, linear model):
    cache_ttl_sec = 120
    rule_complexity_score = 10
    sdk_polling_interval_sec = 300
    Predicted value: 85.5800

  Model quality: Weak fit — consider adding center points or using a different design.

Factor importance:
  1. sdk_polling_interval_sec  (effect: 12.7, contribution: 51.4%)
  2. cache_ttl_sec  (effect: 7.3, contribution: 29.7%)
  3. rule_complexity_score  (effect: 4.7, contribution: 18.9%)