← All Use Cases
📊
Full Factorial Design

Columnar Compression

Full factorial of compression codec, dictionary encoding, page size, and row group size for ratio vs speed

Summary

This experiment investigates columnar compression. Full factorial of compression codec, dictionary encoding, page size, and row group size for ratio vs speed.

The design varies 4 factors: codec, ranging from snappy to zstd, dictionary, ranging from off to on, page size kb (KB), ranging from 64 to 1024, and row group mb (MB), ranging from 32 to 256. The goal is to optimize 2 responses: compression ratio (x) (maximize) and read speed mbps (MB/s) (maximize). Fixed conditions held constant across all runs include format = parquet, data type = mixed_analytics.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For compression ratio, the most influential factors were dictionary (42.3%), codec (38.6%), row group mb (13.8%). The best observed value was 8.7 (at codec = snappy, dictionary = on, page size kb = 1024).

For read speed mbps, the most influential factors were dictionary (58.1%), row group mb (28.3%), page size kb (7.3%). The best observed value was 1323.0 (at codec = snappy, dictionary = on, page size kb = 64).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
codecsnappyzstd
dictionaryoffon
page_size_kb641024KB
row_group_mb32256MB

Fixed: format = parquet, data_type = mixed_analytics

Responses

ResponseDirectionUnit
compression_ratio↑ maximizex
read_speed_mbps↑ maximizeMB/s

Configuration

use_cases/41_columnar_compression/config.json
{ "metadata": { "name": "Columnar Compression", "description": "Full factorial of compression codec, dictionary encoding, page size, and row group size for ratio vs speed" }, "factors": [ { "name": "codec", "levels": [ "snappy", "zstd" ], "type": "categorical", "unit": "" }, { "name": "dictionary", "levels": [ "off", "on" ], "type": "categorical", "unit": "" }, { "name": "page_size_kb", "levels": [ "64", "1024" ], "type": "continuous", "unit": "KB" }, { "name": "row_group_mb", "levels": [ "32", "256" ], "type": "continuous", "unit": "MB" } ], "fixed_factors": { "format": "parquet", "data_type": "mixed_analytics" }, "responses": [ { "name": "compression_ratio", "optimize": "maximize", "unit": "x" }, { "name": "read_speed_mbps", "optimize": "maximize", "unit": "MB/s" } ], "settings": { "operation": "full_factorial", "test_script": "use_cases/41_columnar_compression/sim.sh" } }

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Runcodecdictionarypage_size_kbrow_group_mb
1snappyon1024256
2zstdoff64256
3snappyon64256
4snappyon102432
5zstdon102432
6zstdoff102432
7zstdon6432
8zstdoff6432
9snappyoff64256
10snappyoff102432
11zstdon64256
12zstdon1024256
13snappyon6432
14zstdoff1024256
15snappyoff6432
16snappyoff1024256

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/41_columnar_compression/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/41_columnar_compression/config.json \ --output use_cases/41_columnar_compression/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/41_columnar_compression/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/41_columnar_compression/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/41_columnar_compression/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/41_columnar_compression/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/41_columnar_compression/config.json \ --output use_cases/41_columnar_compression/results/report.html

Features Exercised

FeatureValue
Design typefull_factorial
Factor typescontinuous (2), categorical (2)
Arg styledouble-dash
Responses2 (compression_ratio ↑, read_speed_mbps ↑)
Total runs16

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: compression_ratio

Top factors: dictionary (42.3%), codec (38.6%), row_group_mb (13.8%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
codec111.645211.64522.1710.2006
dictionary113.968913.96892.6040.1675
page_size_kb10.21390.21390.0400.8496
row_group_mb11.48231.48230.2760.6216
codec*dictionary16.98286.98281.3020.3056
codec*page_size_kb14.56894.56890.8520.3984
codec*row_group_mb11.92521.92520.3590.5752
dictionary*page_size_kb10.38750.38750.0720.7988
dictionary*row_group_mb10.99500.99500.1850.6846
page_size_kb*row_group_mb10.62020.62020.1160.7477
Error526.82165.3643
Total1569.61144.6408

Pareto Chart

Pareto chart for compression_ratio

Main Effects Plot

Main effects plot for compression_ratio

Normal Probability Plot of Effects

Normal probability plot for compression_ratio

Half-Normal Plot of Effects

Half-normal plot for compression_ratio

Model Diagnostics

Model diagnostics for compression_ratio

Response: read_speed_mbps

Top factors: dictionary (58.1%), row_group_mb (28.3%), page_size_kb (7.3%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
codec11105.56251105.56250.0220.8871
dictionary196876.562596876.56251.9570.2207
page_size_kb11540.56251540.56250.0310.8669
row_group_mb123028.062523028.06250.4650.5256
codec*dictionary139303.062539303.06250.7940.4138
codec*page_size_kb126325.062526325.06250.5320.4986
codec*row_group_mb1464783.0625464783.06259.3870.0280
dictionary*page_size_kb128815.062528815.06250.5820.4800
dictionary*row_group_mb171422.562571422.56251.4430.2835
page_size_kb*row_group_mb141107.562541107.56250.8300.4040
Error5247557.812549511.5625
Total151041864.937569457.6625

Pareto Chart

Pareto chart for read_speed_mbps

Main Effects Plot

Main effects plot for read_speed_mbps

Normal Probability Plot of Effects

Normal probability plot for read_speed_mbps

Half-Normal Plot of Effects

Half-normal plot for read_speed_mbps

Model Diagnostics

Model diagnostics for read_speed_mbps

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

compression ratio page size kb vs row group mb

RSM surface: compression ratio page size kb vs row group mb

read speed mbps page size kb vs row group mb

RSM surface: read speed mbps page size kb vs row group mb

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.7010

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
compression_ratio 1.0
0.4412
4.73 0.4412 4.73 x
read_speed_mbps 1.5
0.9545
1323.00 0.9545 1323.00 MB/s

Recommended Settings

FactorValue
codeczstd
dictionaryoff
page_size_kb64 KB
row_group_mb256 MB

Source: from observed run #1

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
read_speed_mbps1323.001323.00+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#120.5858codec=zstd, dictionary=on, page_size_kb=64, row_group_mb=32
#30.5342codec=snappy, dictionary=off, page_size_kb=1024, row_group_mb=256

Model Quality

ResponseType
read_speed_mbps0.1315linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.7010 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- compression_ratio 1.0 0.4412 4.73 x ↑ read_speed_mbps 1.5 0.9545 1323.00 MB/s ↑ Recommended settings: codec = zstd dictionary = off page_size_kb = 64 KB row_group_mb = 256 MB (from observed run #1) Trade-off summary: compression_ratio: 4.73 (best observed: 8.70, sacrifice: +3.97) read_speed_mbps: 1323.00 (best observed: 1323.00, sacrifice: +0.00) Model quality: compression_ratio: R² = 0.3839 (linear) read_speed_mbps: R² = 0.1315 (linear) Top 3 observed runs by overall desirability: 1. Run #1 (D=0.7010): codec=zstd, dictionary=off, page_size_kb=64, row_group_mb=256 2. Run #12 (D=0.5858): codec=zstd, dictionary=on, page_size_kb=64, row_group_mb=32 3. Run #3 (D=0.5342): codec=snappy, dictionary=off, page_size_kb=1024, row_group_mb=256

Full Analysis Output

doe analyze
=== Main Effects: compression_ratio === Factor Effect Std Error % Contribution -------------------------------------------------------------- dictionary -1.8687 0.5386 42.3% codec 1.7063 0.5386 38.6% row_group_mb -0.6087 0.5386 13.8% page_size_kb 0.2313 0.5386 5.2% === ANOVA Table: compression_ratio === Source DF SS MS F p-value ----------------------------------------------------------------------------- codec 1 11.6452 11.6452 2.171 0.2006 dictionary 1 13.9689 13.9689 2.604 0.1675 page_size_kb 1 0.2139 0.2139 0.040 0.8496 row_group_mb 1 1.4823 1.4823 0.276 0.6216 codec*dictionary 1 6.9828 6.9828 1.302 0.3056 codec*page_size_kb 1 4.5689 4.5689 0.852 0.3984 codec*row_group_mb 1 1.9252 1.9252 0.359 0.5752 dictionary*page_size_kb 1 0.3875 0.3875 0.072 0.7988 dictionary*row_group_mb 1 0.9950 0.9950 0.185 0.6846 page_size_kb*row_group_mb 1 0.6202 0.6202 0.116 0.7477 Error 5 26.8216 5.3643 Total 15 69.6114 4.6408 === Interaction Effects: compression_ratio === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ codec dictionary -1.3213 30.8% codec page_size_kb 1.0687 24.9% codec row_group_mb 0.6937 16.2% dictionary row_group_mb 0.4988 11.6% page_size_kb row_group_mb 0.3938 9.2% dictionary page_size_kb -0.3112 7.3% === Summary Statistics: compression_ratio === codec: Level N Mean Std Min Max ------------------------------------------------------------ snappy 8 3.6012 1.7388 1.6700 6.7300 zstd 8 5.3075 2.2929 2.5300 8.7000 dictionary: Level N Mean Std Min Max ------------------------------------------------------------ off 8 5.3887 2.5996 1.7700 8.7000 on 8 3.5200 1.0914 1.6700 4.8000 page_size_kb: Level N Mean Std Min Max ------------------------------------------------------------ 1024 8 4.3388 1.9090 2.4400 7.5200 64 8 4.5700 2.5039 1.6700 8.7000 row_group_mb: Level N Mean Std Min Max ------------------------------------------------------------ 256 8 4.7588 2.1186 1.6700 8.7000 32 8 4.1500 2.2900 1.7700 7.6100 === Main Effects: read_speed_mbps === Factor Effect Std Error % Contribution -------------------------------------------------------------- dictionary 155.6250 65.8871 58.1% row_group_mb 75.8750 65.8871 28.3% page_size_kb 19.6250 65.8871 7.3% codec 16.6250 65.8871 6.2% === ANOVA Table: read_speed_mbps === Source DF SS MS F p-value ----------------------------------------------------------------------------- codec 1 1105.5625 1105.5625 0.022 0.8871 dictionary 1 96876.5625 96876.5625 1.957 0.2207 page_size_kb 1 1540.5625 1540.5625 0.031 0.8669 row_group_mb 1 23028.0625 23028.0625 0.465 0.5256 codec*dictionary 1 39303.0625 39303.0625 0.794 0.4138 codec*page_size_kb 1 26325.0625 26325.0625 0.532 0.4986 codec*row_group_mb 1 464783.0625 464783.0625 9.387 0.0280 dictionary*page_size_kb 1 28815.0625 28815.0625 0.582 0.4800 dictionary*row_group_mb 1 71422.5625 71422.5625 1.443 0.2835 page_size_kb*row_group_mb 1 41107.5625 41107.5625 0.830 0.4040 Error 5 247557.8125 49511.5625 Total 15 1041864.9375 69457.6625 === Interaction Effects: read_speed_mbps === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ codec row_group_mb -340.8750 40.5% dictionary row_group_mb 133.6250 15.9% page_size_kb row_group_mb -101.3750 12.1% codec dictionary -99.1250 11.8% dictionary page_size_kb 84.8750 10.1% codec page_size_kb -81.1250 9.6% === Summary Statistics: read_speed_mbps === codec: Level N Mean Std Min Max ------------------------------------------------------------ snappy 8 786.6250 295.3182 478.0000 1323.0000 zstd 8 803.2500 247.9255 483.0000 1137.0000 dictionary: Level N Mean Std Min Max ------------------------------------------------------------ off 8 717.1250 190.1011 531.0000 1077.0000 on 8 872.7500 314.4200 478.0000 1323.0000 page_size_kb: Level N Mean Std Min Max ------------------------------------------------------------ 1024 8 785.1250 254.8660 478.0000 1098.0000 64 8 804.7500 289.2422 483.0000 1323.0000 row_group_mb: Level N Mean Std Min Max ------------------------------------------------------------ 256 8 757.0000 247.3430 478.0000 1137.0000 32 8 832.8750 290.4644 483.0000 1323.0000

Optimization Recommendations

doe optimize
=== Optimization: compression_ratio === Direction: maximize Best observed run: #12 codec = snappy dictionary = on page_size_kb = 1024 row_group_mb = 256 Value: 8.7 RSM Model (linear, R² = 0.2908, Adj R² = 0.0329): Coefficients: intercept +4.4544 codec -0.6481 dictionary +0.7294 page_size_kb -0.1919 row_group_mb +0.5256 RSM Model (quadratic, R² = 0.7457, Adj R² = -2.8142): Coefficients: intercept +0.8909 codec -0.6481 dictionary +0.7294 page_size_kb -0.1919 row_group_mb +0.5256 codec*dictionary +0.5944 codec*page_size_kb -0.8369 codec*row_group_mb -0.3644 dictionary*page_size_kb -0.5244 dictionary*row_group_mb -0.0869 page_size_kb*row_group_mb +0.7144 codec^2 +0.8909 dictionary^2 +0.8909 page_size_kb^2 +0.8909 row_group_mb^2 +0.8909 Curvature analysis: row_group_mb coef=+0.8909 convex (has a minimum) page_size_kb coef=+0.8909 convex (has a minimum) codec coef=+0.8909 convex (has a minimum) dictionary coef=+0.8909 convex (has a minimum) Notable interactions: codec*page_size_kb coef=-0.8369 (antagonistic) page_size_kb*row_group_mb coef=+0.7144 (synergistic) codec*dictionary coef=+0.5944 (synergistic) dictionary*page_size_kb coef=-0.5244 (antagonistic) codec*row_group_mb coef=-0.3644 (antagonistic) Predicted optimum (from linear model, at observed points): codec = snappy dictionary = on page_size_kb = 64 row_group_mb = 256 Predicted value: 6.5494 Surface optimum (via L-BFGS-B, linear model): codec = snappy dictionary = on page_size_kb = 64 row_group_mb = 256 Predicted value: 6.5494 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. dictionary (effect: 1.5, contribution: 34.8%) 2. codec (effect: -1.3, contribution: 30.9%) 3. row_group_mb (effect: -1.1, contribution: 25.1%) 4. page_size_kb (effect: 0.4, contribution: 9.2%) === Optimization: read_speed_mbps === Direction: maximize Best observed run: #1 codec = snappy dictionary = on page_size_kb = 64 row_group_mb = 32 Value: 1323.0 RSM Model (linear, R² = 0.4807, Adj R² = 0.2918): Coefficients: intercept +794.9375 codec -33.4375 dictionary +88.8125 page_size_kb +69.1875 row_group_mb -132.3125 RSM Model (quadratic, R² = 0.6351, Adj R² = -4.4734): Coefficients: intercept +158.9875 codec -33.4375 dictionary +88.8125 page_size_kb +69.1875 row_group_mb -132.3125 codec*dictionary -25.8125 codec*page_size_kb +4.0625 codec*row_group_mb +60.3125 dictionary*page_size_kb +64.5625 dictionary*row_group_mb -33.6875 page_size_kb*row_group_mb -20.8125 codec^2 +158.9875 dictionary^2 +158.9875 page_size_kb^2 +158.9875 row_group_mb^2 +158.9875 Curvature analysis: codec coef=+158.9875 convex (has a minimum) page_size_kb coef=+158.9875 convex (has a minimum) dictionary coef=+158.9875 convex (has a minimum) row_group_mb coef=+158.9875 convex (has a minimum) Notable interactions: dictionary*page_size_kb coef=+64.5625 (synergistic) codec*row_group_mb coef=+60.3125 (synergistic) dictionary*row_group_mb coef=-33.6875 (antagonistic) codec*dictionary coef=-25.8125 (antagonistic) page_size_kb*row_group_mb coef=-20.8125 (antagonistic) codec*page_size_kb coef=+4.0625 (synergistic) Predicted optimum (from linear model, at observed points): codec = snappy dictionary = on page_size_kb = 1024 row_group_mb = 32 Predicted value: 1118.6875 Surface optimum (via L-BFGS-B, linear model): codec = snappy dictionary = on page_size_kb = 1024 row_group_mb = 32 Predicted value: 1118.6875 Model quality: Weak fit — consider adding center points or using a different design. Factor importance: 1. row_group_mb (effect: 264.6, contribution: 40.9%) 2. dictionary (effect: 177.6, contribution: 27.4%) 3. page_size_kb (effect: -138.4, contribution: 21.4%) 4. codec (effect: -66.9, contribution: 10.3%)
← Previous: ETL Batch Size Tuning Next: Query Engine Join Strategy →