Summary
This experiment investigates data lake partitioning. Full factorial of partitioning strategy, file format, and compaction for query speed and storage cost.
The design varies 4 factors: partition cols, ranging from date to date_hour_region, file format, ranging from parquet to orc, target file mb (MB), ranging from 64 to 256, and z order, ranging from off to on. The goal is to optimize 2 responses: query time s (s) (minimize) and storage cost month (USD) (minimize). Fixed conditions held constant across all runs include table size tb = 5, retention days = 90.
A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (24 runs).
Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.
Key Findings
For query time s, the most influential factors were partition cols (57.1%), file format (16.8%), z order (15.0%). The best observed value was 2.9 (at partition cols = date_hour, file format = orc, target file mb = 64).
For storage cost month, the most influential factors were partition cols (54.4%), file format (22.3%), z order (18.9%). The best observed value was 182.0 (at partition cols = date, file format = parquet, target file mb = 64).
Recommended Next Steps
- Consider whether any fixed factors should be varied in a future study.
Experimental Setup
Factors
| Factor | Low | High | Unit |
partition_cols | date | date_hour_region | |
file_format | parquet | orc | |
target_file_mb | 64 | 256 | MB |
z_order | off | on | |
Fixed: table_size_tb = 5, retention_days = 90
Responses
| Response | Direction | Unit |
query_time_s | ↓ minimize | s |
storage_cost_month | ↓ minimize | USD |
Configuration
{
"metadata": {
"name": "Data Lake Partitioning",
"description": "Full factorial of partitioning strategy, file format, and compaction for query speed and storage cost"
},
"factors": [
{
"name": "partition_cols",
"levels": [
"date",
"date_hour",
"date_hour_region"
],
"type": "categorical",
"unit": ""
},
{
"name": "file_format",
"levels": [
"parquet",
"orc"
],
"type": "categorical",
"unit": ""
},
{
"name": "target_file_mb",
"levels": [
"64",
"256"
],
"type": "continuous",
"unit": "MB"
},
{
"name": "z_order",
"levels": [
"off",
"on"
],
"type": "categorical",
"unit": ""
}
],
"fixed_factors": {
"table_size_tb": "5",
"retention_days": "90"
},
"responses": [
{
"name": "query_time_s",
"optimize": "minimize",
"unit": "s"
},
{
"name": "storage_cost_month",
"optimize": "minimize",
"unit": "USD"
}
],
"settings": {
"operation": "full_factorial",
"test_script": "use_cases/38_data_lake_partitioning/sim.sh"
}
}
Experimental Matrix
The Full Factorial Design produces 24 runs. Each row is one experiment with specific factor settings.
| Run | partition_cols | file_format | target_file_mb | z_order |
| 1 | date_hour | orc | 256 | on |
| 2 | date | orc | 64 | on |
| 3 | date_hour_region | parquet | 256 | off |
| 4 | date_hour_region | orc | 256 | on |
| 5 | date_hour | orc | 64 | on |
| 6 | date_hour | parquet | 256 | off |
| 7 | date_hour_region | parquet | 64 | on |
| 8 | date_hour | parquet | 256 | on |
| 9 | date_hour | orc | 256 | off |
| 10 | date_hour_region | parquet | 64 | off |
| 11 | date | parquet | 64 | on |
| 12 | date_hour | orc | 64 | off |
| 13 | date_hour_region | orc | 64 | on |
| 14 | date | orc | 256 | off |
| 15 | date_hour | parquet | 64 | on |
| 16 | date | parquet | 256 | off |
| 17 | date_hour_region | orc | 256 | off |
| 18 | date | orc | 64 | off |
| 19 | date_hour_region | parquet | 256 | on |
| 20 | date | orc | 256 | on |
| 21 | date_hour | parquet | 64 | off |
| 22 | date | parquet | 64 | off |
| 23 | date | parquet | 256 | on |
| 24 | date_hour_region | orc | 64 | off |
Step-by-Step Workflow
1
Preview the design
$ doe info --config use_cases/38_data_lake_partitioning/config.json
2
Generate the runner script
$ doe generate --config use_cases/38_data_lake_partitioning/config.json \
--output use_cases/38_data_lake_partitioning/results/run.sh --seed 42
3
Execute the experiments
$ bash use_cases/38_data_lake_partitioning/results/run.sh
4
Analyze results
$ doe analyze --config use_cases/38_data_lake_partitioning/config.json
5
Get optimization recommendations
$ doe optimize --config use_cases/38_data_lake_partitioning/config.json
6
Multi-objective optimization
With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.
$ doe optimize --config use_cases/38_data_lake_partitioning/config.json --multi
7
Generate the HTML report
$ doe report --config use_cases/38_data_lake_partitioning/config.json \
--output use_cases/38_data_lake_partitioning/results/report.html
Features Exercised
| Feature | Value |
| Design type | full_factorial |
| Factor types | continuous (1), categorical (3) |
| Arg style | double-dash |
| Responses | 2 (query_time_s ↓, storage_cost_month ↓) |
| Total runs | 24 |
Analysis Results
Generated from actual experiment runs using the DOE Helper Tool.
Response: query_time_s
Top factors: partition_cols (57.1%), file_format (16.8%), z_order (15.0%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| partition_cols | 2 | 114.0775 | 57.0388 | 1.976 | 0.1731 |
| file_format | 1 | 13.5000 | 13.5000 | 0.468 | 0.5045 |
| target_file_mb | 1 | 5.8017 | 5.8017 | 0.201 | 0.6603 |
| z_order | 1 | 10.6667 | 10.6667 | 0.370 | 0.5523 |
| file_format*target_file_mb | 1 | 32.2017 | 32.2017 | 1.116 | 0.3076 |
| file_format*z_order | 1 | 64.0267 | 64.0267 | 2.218 | 0.1571 |
| target_file_mb*z_order | 1 | 0.2017 | 0.2017 | 0.007 | 0.9345 |
| Error | 15 | 432.9242 | 28.8616 | | |
| Total | 23 | 673.4000 | 29.2783 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response: storage_cost_month
Top factors: partition_cols (54.4%), file_format (22.3%), z_order (18.9%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| partition_cols | 2 | 4073.2500 | 2036.6250 | 1.911 | 0.1823 |
| file_format | 1 | 1027.0417 | 1027.0417 | 0.963 | 0.3419 |
| target_file_mb | 1 | 40.0417 | 40.0417 | 0.038 | 0.8489 |
| z_order | 1 | 737.0417 | 737.0417 | 0.691 | 0.4187 |
| file_format*target_file_mb | 1 | 22.0417 | 22.0417 | 0.021 | 0.8876 |
| file_format*z_order | 1 | 2.0417 | 2.0417 | 0.002 | 0.9657 |
| target_file_mb*z_order | 1 | 1.0417 | 1.0417 | 0.001 | 0.9755 |
| Error | 15 | 15990.1250 | 1066.0083 | | |
| Total | 23 | 21892.6250 | 951.8533 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.7734
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
query_time_s |
1.5 |
|
9.30 0.6721 9.30 s |
↓ |
storage_cost_month |
1.0 |
|
182.00 0.9545 182.00 USD |
↓ |
Recommended Settings
| Factor | Value |
partition_cols | date_hour_region |
file_format | orc |
target_file_mb | 64 MB |
z_order | on |
Source: from observed run #23
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
storage_cost_month | 182.00 | 182.00 | +0.00 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #20 | 0.7635 | partition_cols=date_hour_region, file_format=orc, target_file_mb=256, z_order=on |
| #1 | 0.7469 | partition_cols=date, file_format=orc, target_file_mb=64, z_order=on |
Model Quality
| Response | R² | Type |
storage_cost_month | 0.2973 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.7734
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
query_time_s 1.5 0.6721 9.30 s ↓
storage_cost_month 1.0 0.9545 182.00 USD ↓
Recommended settings:
partition_cols = date_hour_region
file_format = orc
target_file_mb = 64 MB
z_order = on
(from observed run #23)
Trade-off summary:
query_time_s: 9.30 (best observed: 2.90, sacrifice: +6.40)
storage_cost_month: 182.00 (best observed: 182.00, sacrifice: +0.00)
Model quality:
query_time_s: R² = 0.1417 (linear)
storage_cost_month: R² = 0.2973 (linear)
Top 3 observed runs by overall desirability:
1. Run #23 (D=0.7734): partition_cols=date_hour_region, file_format=orc, target_file_mb=64, z_order=on
2. Run #20 (D=0.7635): partition_cols=date_hour_region, file_format=orc, target_file_mb=256, z_order=on
3. Run #1 (D=0.7469): partition_cols=date, file_format=orc, target_file_mb=64, z_order=on
Full Analysis Output
=== Main Effects: query_time_s ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
partition_cols 5.0875 1.1045 57.1%
file_format 1.5000 1.1045 16.8%
z_order -1.3333 1.1045 15.0%
target_file_mb 0.9833 1.1045 11.0%
=== ANOVA Table: query_time_s ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
partition_cols 2 114.0775 57.0388 1.976 0.1731
file_format 1 13.5000 13.5000 0.468 0.5045
target_file_mb 1 5.8017 5.8017 0.201 0.6603
z_order 1 10.6667 10.6667 0.370 0.5523
file_format*target_file_mb 1 32.2017 32.2017 1.116 0.3076
file_format*z_order 1 64.0267 64.0267 2.218 0.1571
target_file_mb*z_order 1 0.2017 0.2017 0.007 0.9345
Error 15 432.9242 28.8616
Total 23 673.4000 29.2783
=== Interaction Effects: query_time_s ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
file_format z_order -3.2667 56.6%
file_format target_file_mb 2.3167 40.2%
target_file_mb z_order -0.1833 3.2%
=== Summary Statistics: query_time_s ===
partition_cols:
Level N Mean Std Min Max
------------------------------------------------------------
date 8 13.1375 5.9969 5.5000 23.5000
date_hour 8 14.2750 3.9326 8.6000 20.3000
date_hour_region 8 9.1875 5.3362 2.9000 20.5000
file_format:
Level N Mean Std Min Max
------------------------------------------------------------
orc 12 11.4500 5.8520 2.9000 23.5000
parquet 12 12.9500 5.0739 4.9000 20.5000
target_file_mb:
Level N Mean Std Min Max
------------------------------------------------------------
256 12 11.7083 4.9471 2.9000 20.3000
64 12 12.6917 6.0181 4.9000 23.5000
z_order:
Level N Mean Std Min Max
------------------------------------------------------------
off 12 12.8667 4.2455 6.1000 20.5000
on 12 11.5333 6.4980 2.9000 23.5000
=== Main Effects: storage_cost_month ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
partition_cols 31.8750 6.2977 54.4%
file_format 13.0833 6.2977 22.3%
z_order -11.0833 6.2977 18.9%
target_file_mb 2.5833 6.2977 4.4%
=== ANOVA Table: storage_cost_month ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
partition_cols 2 4073.2500 2036.6250 1.911 0.1823
file_format 1 1027.0417 1027.0417 0.963 0.3419
target_file_mb 1 40.0417 40.0417 0.038 0.8489
z_order 1 737.0417 737.0417 0.691 0.4187
file_format*target_file_mb 1 22.0417 22.0417 0.021 0.8876
file_format*z_order 1 2.0417 2.0417 0.002 0.9657
target_file_mb*z_order 1 1.0417 1.0417 0.001 0.9755
Error 15 15990.1250 1066.0083
Total 23 21892.6250 951.8533
=== Interaction Effects: storage_cost_month ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
file_format target_file_mb -1.9167 65.7%
file_format z_order -0.5833 20.0%
target_file_mb z_order -0.4167 14.3%
=== Summary Statistics: storage_cost_month ===
partition_cols:
Level N Mean Std Min Max
------------------------------------------------------------
date 8 264.6250 25.0196 237.0000 312.0000
date_hour 8 232.7500 38.9129 182.0000 281.0000
date_hour_region 8 250.0000 20.1353 215.0000 271.0000
file_format:
Level N Mean Std Min Max
------------------------------------------------------------
orc 12 242.5833 37.6430 182.0000 312.0000
parquet 12 255.6667 21.9061 220.0000 283.0000
target_file_mb:
Level N Mean Std Min Max
------------------------------------------------------------
256 12 247.8333 26.4638 185.0000 273.0000
64 12 250.4167 35.8645 182.0000 312.0000
z_order:
Level N Mean Std Min Max
------------------------------------------------------------
off 12 254.6667 31.0698 182.0000 312.0000
on 12 243.5833 30.9500 185.0000 283.0000
Optimization Recommendations
=== Optimization: query_time_s ===
Direction: minimize
Best observed run: #4
partition_cols = date_hour
file_format = orc
target_file_mb = 64
z_order = on
Value: 2.9
RSM Model (linear, R² = 0.0752, Adj R² = -0.1195):
Coefficients:
intercept +12.2000
partition_cols +1.2000
file_format +0.9833
target_file_mb -0.3917
z_order -0.1750
RSM Model (quadratic, R² = 0.3533, Adj R² = -0.6526):
Coefficients:
intercept +2.7750
partition_cols +1.2000
file_format +0.9833
target_file_mb -0.3917
z_order -0.1750
partition_cols*file_format -0.0750
partition_cols*target_file_mb +0.4375
partition_cols*z_order +0.4875
file_format*target_file_mb +1.3083
file_format*z_order +2.1583
target_file_mb*z_order +0.7333
partition_cols^2 +1.6500
file_format^2 +2.7750
target_file_mb^2 +2.7750
z_order^2 +2.7750
Curvature analysis:
target_file_mb coef=+2.7750 convex (has a minimum)
z_order coef=+2.7750 convex (has a minimum)
file_format coef=+2.7750 convex (has a minimum)
partition_cols coef=+1.6500 convex (has a minimum)
Notable interactions:
file_format*z_order coef=+2.1583 (synergistic)
file_format*target_file_mb coef=+1.3083 (synergistic)
target_file_mb*z_order coef=+0.7333 (synergistic)
partition_cols*z_order coef=+0.4875 (synergistic)
partition_cols*target_file_mb coef=+0.4375 (synergistic)
Predicted optimum (from linear model, at observed points):
partition_cols = date_hour_region
file_format = parquet
target_file_mb = 64
z_order = off
Predicted value: 14.9500
Surface optimum (via L-BFGS-B, linear model):
partition_cols = date
file_format = parquet
target_file_mb = 256
z_order = on
Predicted value: 9.4500
Model quality: Weak fit — consider adding center points or using a different design.
Factor importance:
1. partition_cols (effect: 2.8, contribution: 47.9%)
2. file_format (effect: 2.0, contribution: 33.1%)
3. target_file_mb (effect: 0.8, contribution: 13.2%)
4. z_order (effect: -0.3, contribution: 5.9%)
=== Optimization: storage_cost_month ===
Direction: minimize
Best observed run: #23
partition_cols = date
file_format = parquet
target_file_mb = 64
z_order = off
Value: 182.0
RSM Model (linear, R² = 0.3041, Adj R² = 0.1576):
Coefficients:
intercept +249.1250
partition_cols +3.7500
file_format -15.6250
target_file_mb -2.6250
z_order +4.1250
RSM Model (quadratic, R² = 0.5619, Adj R² = -0.1195):
Coefficients:
intercept +63.3438
partition_cols +3.7500
file_format -15.6250
target_file_mb -2.6250
z_order +4.1250
partition_cols*file_format +7.1250
partition_cols*target_file_mb -2.6250
partition_cols*z_order -5.0000
file_format*target_file_mb +3.4583
file_format*z_order -7.2917
target_file_mb*z_order -10.2917
partition_cols^2 -6.3750
file_format^2 +63.3438
target_file_mb^2 +63.3438
z_order^2 +63.3438
Curvature analysis:
target_file_mb coef=+63.3438 convex (has a minimum)
z_order coef=+63.3438 convex (has a minimum)
file_format coef=+63.3438 convex (has a minimum)
partition_cols coef=-6.3750 concave (has a maximum)
Notable interactions:
target_file_mb*z_order coef=-10.2917 (antagonistic)
file_format*z_order coef=-7.2917 (antagonistic)
partition_cols*file_format coef=+7.1250 (synergistic)
partition_cols*z_order coef=-5.0000 (antagonistic)
file_format*target_file_mb coef=+3.4583 (synergistic)
partition_cols*target_file_mb coef=-2.6250 (antagonistic)
Predicted optimum (from linear model, at observed points):
partition_cols = date_hour_region
file_format = orc
target_file_mb = 64
z_order = on
Predicted value: 275.2500
Surface optimum (via L-BFGS-B, linear model):
partition_cols = date
file_format = orc
target_file_mb = 256
z_order = off
Predicted value: 223.0000
Model quality: Weak fit — consider adding center points or using a different design.
Factor importance:
1. file_format (effect: -31.2, contribution: 56.9%)
2. partition_cols (effect: 10.1, contribution: 18.5%)
3. z_order (effect: 8.2, contribution: 15.0%)
4. target_file_mb (effect: 5.2, contribution: 9.6%)