← All Use Cases
📊
Full Factorial Design

Data Lake Partitioning

Full factorial of partitioning strategy, file format, and compaction for query speed and storage cost

Summary

This experiment investigates data lake partitioning. Full factorial of partitioning strategy, file format, and compaction for query speed and storage cost.

The design varies 4 factors: partition cols, ranging from date to date_hour_region, file format, ranging from parquet to orc, target file mb (MB), ranging from 64 to 256, and z order, ranging from off to on. The goal is to optimize 2 responses: query time s (s) (minimize) and storage cost month (USD) (minimize). Fixed conditions held constant across all runs include table size tb = 5, retention days = 90.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (24 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For query time s, the most influential factors were partition cols (57.1%), file format (16.8%), z order (15.0%). The best observed value was 2.9 (at partition cols = date_hour, file format = orc, target file mb = 64).

For storage cost month, the most influential factors were partition cols (54.4%), file format (22.3%), z order (18.9%). The best observed value was 182.0 (at partition cols = date, file format = parquet, target file mb = 64).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
partition_colsdatedate_hour_region
file_formatparquetorc
target_file_mb64256MB
z_orderoffon

Fixed: table_size_tb = 5, retention_days = 90

Responses

ResponseDirectionUnit
query_time_s↓ minimizes
storage_cost_month↓ minimizeUSD

Configuration

use_cases/38_data_lake_partitioning/config.json
{ "metadata": { "name": "Data Lake Partitioning", "description": "Full factorial of partitioning strategy, file format, and compaction for query speed and storage cost" }, "factors": [ { "name": "partition_cols", "levels": [ "date", "date_hour", "date_hour_region" ], "type": "categorical", "unit": "" }, { "name": "file_format", "levels": [ "parquet", "orc" ], "type": "categorical", "unit": "" }, { "name": "target_file_mb", "levels": [ "64", "256" ], "type": "continuous", "unit": "MB" }, { "name": "z_order", "levels": [ "off", "on" ], "type": "categorical", "unit": "" } ], "fixed_factors": { "table_size_tb": "5", "retention_days": "90" }, "responses": [ { "name": "query_time_s", "optimize": "minimize", "unit": "s" }, { "name": "storage_cost_month", "optimize": "minimize", "unit": "USD" } ], "settings": { "operation": "full_factorial", "test_script": "use_cases/38_data_lake_partitioning/sim.sh" } }

Experimental Matrix

The Full Factorial Design produces 24 runs. Each row is one experiment with specific factor settings.

Runpartition_colsfile_formattarget_file_mbz_order
1date_hourorc256on
2dateorc64on
3date_hour_regionparquet256off
4date_hour_regionorc256on
5date_hourorc64on
6date_hourparquet256off
7date_hour_regionparquet64on
8date_hourparquet256on
9date_hourorc256off
10date_hour_regionparquet64off
11dateparquet64on
12date_hourorc64off
13date_hour_regionorc64on
14dateorc256off
15date_hourparquet64on
16dateparquet256off
17date_hour_regionorc256off
18dateorc64off
19date_hour_regionparquet256on
20dateorc256on
21date_hourparquet64off
22dateparquet64off
23dateparquet256on
24date_hour_regionorc64off

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/38_data_lake_partitioning/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/38_data_lake_partitioning/config.json \ --output use_cases/38_data_lake_partitioning/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/38_data_lake_partitioning/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/38_data_lake_partitioning/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/38_data_lake_partitioning/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/38_data_lake_partitioning/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/38_data_lake_partitioning/config.json \ --output use_cases/38_data_lake_partitioning/results/report.html

Features Exercised

FeatureValue
Design typefull_factorial
Factor typescontinuous (1), categorical (3)
Arg styledouble-dash
Responses2 (query_time_s ↓, storage_cost_month ↓)
Total runs24

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: query_time_s

Top factors: partition_cols (57.1%), file_format (16.8%), z_order (15.0%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
partition_cols2114.077557.03881.9760.1731
file_format113.500013.50000.4680.5045
target_file_mb15.80175.80170.2010.6603
z_order110.666710.66670.3700.5523
file_format*target_file_mb132.201732.20171.1160.3076
file_format*z_order164.026764.02672.2180.1571
target_file_mb*z_order10.20170.20170.0070.9345
Error15432.924228.8616
Total23673.400029.2783

Pareto Chart

Pareto chart for query_time_s

Main Effects Plot

Main effects plot for query_time_s

Normal Probability Plot of Effects

Normal probability plot for query_time_s

Half-Normal Plot of Effects

Half-normal plot for query_time_s

Model Diagnostics

Model diagnostics for query_time_s

Response: storage_cost_month

Top factors: partition_cols (54.4%), file_format (22.3%), z_order (18.9%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
partition_cols24073.25002036.62501.9110.1823
file_format11027.04171027.04170.9630.3419
target_file_mb140.041740.04170.0380.8489
z_order1737.0417737.04170.6910.4187
file_format*target_file_mb122.041722.04170.0210.8876
file_format*z_order12.04172.04170.0020.9657
target_file_mb*z_order11.04171.04170.0010.9755
Error1515990.12501066.0083
Total2321892.6250951.8533

Pareto Chart

Pareto chart for storage_cost_month

Main Effects Plot

Main effects plot for storage_cost_month

Normal Probability Plot of Effects

Normal probability plot for storage_cost_month

Half-Normal Plot of Effects

Half-normal plot for storage_cost_month

Model Diagnostics

Model diagnostics for storage_cost_month

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.7734

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
query_time_s 1.5
0.6721
9.30 0.6721 9.30 s
storage_cost_month 1.0
0.9545
182.00 0.9545 182.00 USD

Recommended Settings

FactorValue
partition_colsdate_hour_region
file_formatorc
target_file_mb64 MB
z_orderon

Source: from observed run #23

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
storage_cost_month182.00182.00+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#200.7635partition_cols=date_hour_region, file_format=orc, target_file_mb=256, z_order=on
#10.7469partition_cols=date, file_format=orc, target_file_mb=64, z_order=on

Model Quality

ResponseType
storage_cost_month0.2973linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.7734 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- query_time_s 1.5 0.6721 9.30 s ↓ storage_cost_month 1.0 0.9545 182.00 USD ↓ Recommended settings: partition_cols = date_hour_region file_format = orc target_file_mb = 64 MB z_order = on (from observed run #23) Trade-off summary: query_time_s: 9.30 (best observed: 2.90, sacrifice: +6.40) storage_cost_month: 182.00 (best observed: 182.00, sacrifice: +0.00) Model quality: query_time_s: R² = 0.1417 (linear) storage_cost_month: R² = 0.2973 (linear) Top 3 observed runs by overall desirability: 1. Run #23 (D=0.7734): partition_cols=date_hour_region, file_format=orc, target_file_mb=64, z_order=on 2. Run #20 (D=0.7635): partition_cols=date_hour_region, file_format=orc, target_file_mb=256, z_order=on 3. Run #1 (D=0.7469): partition_cols=date, file_format=orc, target_file_mb=64, z_order=on

Full Analysis Output

doe analyze
=== Main Effects: query_time_s === Factor Effect Std Error % Contribution -------------------------------------------------------------- partition_cols 5.0875 1.1045 57.1% file_format 1.5000 1.1045 16.8% z_order -1.3333 1.1045 15.0% target_file_mb 0.9833 1.1045 11.0% === ANOVA Table: query_time_s === Source DF SS MS F p-value ----------------------------------------------------------------------------- partition_cols 2 114.0775 57.0388 1.976 0.1731 file_format 1 13.5000 13.5000 0.468 0.5045 target_file_mb 1 5.8017 5.8017 0.201 0.6603 z_order 1 10.6667 10.6667 0.370 0.5523 file_format*target_file_mb 1 32.2017 32.2017 1.116 0.3076 file_format*z_order 1 64.0267 64.0267 2.218 0.1571 target_file_mb*z_order 1 0.2017 0.2017 0.007 0.9345 Error 15 432.9242 28.8616 Total 23 673.4000 29.2783 === Interaction Effects: query_time_s === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ file_format z_order -3.2667 56.6% file_format target_file_mb 2.3167 40.2% target_file_mb z_order -0.1833 3.2% === Summary Statistics: query_time_s === partition_cols: Level N Mean Std Min Max ------------------------------------------------------------ date 8 13.1375 5.9969 5.5000 23.5000 date_hour 8 14.2750 3.9326 8.6000 20.3000 date_hour_region 8 9.1875 5.3362 2.9000 20.5000 file_format: Level N Mean Std Min Max ------------------------------------------------------------ orc 12 11.4500 5.8520 2.9000 23.5000 parquet 12 12.9500 5.0739 4.9000 20.5000 target_file_mb: Level N Mean Std Min Max ------------------------------------------------------------ 256 12 11.7083 4.9471 2.9000 20.3000 64 12 12.6917 6.0181 4.9000 23.5000 z_order: Level N Mean Std Min Max ------------------------------------------------------------ off 12 12.8667 4.2455 6.1000 20.5000 on 12 11.5333 6.4980 2.9000 23.5000 === Main Effects: storage_cost_month === Factor Effect Std Error % Contribution -------------------------------------------------------------- partition_cols 31.8750 6.2977 54.4% file_format 13.0833 6.2977 22.3% z_order -11.0833 6.2977 18.9% target_file_mb 2.5833 6.2977 4.4% === ANOVA Table: storage_cost_month === Source DF SS MS F p-value ----------------------------------------------------------------------------- partition_cols 2 4073.2500 2036.6250 1.911 0.1823 file_format 1 1027.0417 1027.0417 0.963 0.3419 target_file_mb 1 40.0417 40.0417 0.038 0.8489 z_order 1 737.0417 737.0417 0.691 0.4187 file_format*target_file_mb 1 22.0417 22.0417 0.021 0.8876 file_format*z_order 1 2.0417 2.0417 0.002 0.9657 target_file_mb*z_order 1 1.0417 1.0417 0.001 0.9755 Error 15 15990.1250 1066.0083 Total 23 21892.6250 951.8533 === Interaction Effects: storage_cost_month === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ file_format target_file_mb -1.9167 65.7% file_format z_order -0.5833 20.0% target_file_mb z_order -0.4167 14.3% === Summary Statistics: storage_cost_month === partition_cols: Level N Mean Std Min Max ------------------------------------------------------------ date 8 264.6250 25.0196 237.0000 312.0000 date_hour 8 232.7500 38.9129 182.0000 281.0000 date_hour_region 8 250.0000 20.1353 215.0000 271.0000 file_format: Level N Mean Std Min Max ------------------------------------------------------------ orc 12 242.5833 37.6430 182.0000 312.0000 parquet 12 255.6667 21.9061 220.0000 283.0000 target_file_mb: Level N Mean Std Min Max ------------------------------------------------------------ 256 12 247.8333 26.4638 185.0000 273.0000 64 12 250.4167 35.8645 182.0000 312.0000 z_order: Level N Mean Std Min Max ------------------------------------------------------------ off 12 254.6667 31.0698 182.0000 312.0000 on 12 243.5833 30.9500 185.0000 283.0000

Optimization Recommendations

doe optimize
=== Optimization: query_time_s === Direction: minimize Best observed run: #4 partition_cols = date_hour file_format = orc target_file_mb = 64 z_order = on Value: 2.9 RSM Model (linear, R² = 0.0752, Adj R² = -0.1195): Coefficients: intercept +12.2000 partition_cols +1.2000 file_format +0.9833 target_file_mb -0.3917 z_order -0.1750 RSM Model (quadratic, R² = 0.3533, Adj R² = -0.6526): Coefficients: intercept +2.7750 partition_cols +1.2000 file_format +0.9833 target_file_mb -0.3917 z_order -0.1750 partition_cols*file_format -0.0750 partition_cols*target_file_mb +0.4375 partition_cols*z_order +0.4875 file_format*target_file_mb +1.3083 file_format*z_order +2.1583 target_file_mb*z_order +0.7333 partition_cols^2 +1.6500 file_format^2 +2.7750 target_file_mb^2 +2.7750 z_order^2 +2.7750 Curvature analysis: target_file_mb coef=+2.7750 convex (has a minimum) z_order coef=+2.7750 convex (has a minimum) file_format coef=+2.7750 convex (has a minimum) partition_cols coef=+1.6500 convex (has a minimum) Notable interactions: file_format*z_order coef=+2.1583 (synergistic) file_format*target_file_mb coef=+1.3083 (synergistic) target_file_mb*z_order coef=+0.7333 (synergistic) partition_cols*z_order coef=+0.4875 (synergistic) partition_cols*target_file_mb coef=+0.4375 (synergistic) Predicted optimum (from linear model, at observed points): partition_cols = date_hour_region file_format = parquet target_file_mb = 64 z_order = off Predicted value: 14.9500 Surface optimum (via L-BFGS-B, linear model): partition_cols = date file_format = parquet target_file_mb = 256 z_order = on Predicted value: 9.4500 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. partition_cols (effect: 2.8, contribution: 47.9%) 2. file_format (effect: 2.0, contribution: 33.1%) 3. target_file_mb (effect: 0.8, contribution: 13.2%) 4. z_order (effect: -0.3, contribution: 5.9%) === Optimization: storage_cost_month === Direction: minimize Best observed run: #23 partition_cols = date file_format = parquet target_file_mb = 64 z_order = off Value: 182.0 RSM Model (linear, R² = 0.3041, Adj R² = 0.1576): Coefficients: intercept +249.1250 partition_cols +3.7500 file_format -15.6250 target_file_mb -2.6250 z_order +4.1250 RSM Model (quadratic, R² = 0.5619, Adj R² = -0.1195): Coefficients: intercept +63.3438 partition_cols +3.7500 file_format -15.6250 target_file_mb -2.6250 z_order +4.1250 partition_cols*file_format +7.1250 partition_cols*target_file_mb -2.6250 partition_cols*z_order -5.0000 file_format*target_file_mb +3.4583 file_format*z_order -7.2917 target_file_mb*z_order -10.2917 partition_cols^2 -6.3750 file_format^2 +63.3438 target_file_mb^2 +63.3438 z_order^2 +63.3438 Curvature analysis: target_file_mb coef=+63.3438 convex (has a minimum) z_order coef=+63.3438 convex (has a minimum) file_format coef=+63.3438 convex (has a minimum) partition_cols coef=-6.3750 concave (has a maximum) Notable interactions: target_file_mb*z_order coef=-10.2917 (antagonistic) file_format*z_order coef=-7.2917 (antagonistic) partition_cols*file_format coef=+7.1250 (synergistic) partition_cols*z_order coef=-5.0000 (antagonistic) file_format*target_file_mb coef=+3.4583 (synergistic) partition_cols*target_file_mb coef=-2.6250 (antagonistic) Predicted optimum (from linear model, at observed points): partition_cols = date_hour_region file_format = orc target_file_mb = 64 z_order = on Predicted value: 275.2500 Surface optimum (via L-BFGS-B, linear model): partition_cols = date file_format = orc target_file_mb = 256 z_order = off Predicted value: 223.0000 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. file_format (effect: -31.2, contribution: 56.9%) 2. partition_cols (effect: 10.1, contribution: 18.5%) 3. z_order (effect: 8.2, contribution: 15.0%) 4. target_file_mb (effect: 5.2, contribution: 9.6%)
← Previous: Spark Shuffle Optimization Next: Stream Processing Windowing →