Columnar Compression — DOE Use Case

Summary

This experiment investigates columnar compression. Full factorial of compression codec, dictionary encoding, page size, and row group size for ratio vs speed.

The design varies 4 factors: codec, ranging from snappy to zstd, dictionary, ranging from off to on, page size kb (KB), ranging from 64 to 1024, and row group mb (MB), ranging from 32 to 256. The goal is to optimize 2 responses: compression ratio (x) (maximize) and read speed mbps (MB/s) (maximize). Fixed conditions held constant across all runs include format = parquet, data type = mixed_analytics.

A full factorial design was used to explore all 16 possible combinations of the 4 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (16 runs).

Quadratic response surface models were fitted to capture potential curvature and factor interactions. The RSM contour plots below visualize how pairs of factors jointly affect each response.

Key Findings

For compression ratio, the most influential factors were dictionary (42.3%), codec (38.6%), row group mb (13.8%). The best observed value was 8.7 (at codec = snappy, dictionary = on, page size kb = 1024).

For read speed mbps, the most influential factors were dictionary (58.1%), row group mb (28.3%), page size kb (7.3%). The best observed value was 1323.0 (at codec = snappy, dictionary = on, page size kb = 64).

Recommended Next Steps

Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Low	High	Unit
`codec`	snappy	zstd
`dictionary`	off	on
`page_size_kb`	64	1024	KB
`row_group_mb`	32	256	MB

Fixed: format = parquet, data_type = mixed_analytics

Responses

Response	Direction	Unit
`compression_ratio`	↑ maximize	x
`read_speed_mbps`	↑ maximize	MB/s

Configuration

use_cases/41_columnar_compression/config.json

{
  "metadata": {
    "name": "Columnar Compression",
    "description": "Full factorial of compression codec, dictionary encoding, page size, and row group size for ratio vs speed"
  },
  "factors": [
    {
      "name": "codec",
      "levels": [
        "snappy",
        "zstd"
      ],
      "type": "categorical",
      "unit": ""
    },
    {
      "name": "dictionary",
      "levels": [
        "off",
        "on"
      ],
      "type": "categorical",
      "unit": ""
    },
    {
      "name": "page_size_kb",
      "levels": [
        "64",
        "1024"
      ],
      "type": "continuous",
      "unit": "KB"
    },
    {
      "name": "row_group_mb",
      "levels": [
        "32",
        "256"
      ],
      "type": "continuous",
      "unit": "MB"
    }
  ],
  "fixed_factors": {
    "format": "parquet",
    "data_type": "mixed_analytics"
  },
  "responses": [
    {
      "name": "compression_ratio",
      "optimize": "maximize",
      "unit": "x"
    },
    {
      "name": "read_speed_mbps",
      "optimize": "maximize",
      "unit": "MB/s"
    }
  ],
  "settings": {
    "operation": "full_factorial",
    "test_script": "use_cases/41_columnar_compression/sim.sh"
  }
}

Experimental Matrix

The Full Factorial Design produces 16 runs. Each row is one experiment with specific factor settings.

Run	`codec`	`dictionary`	`page_size_kb`	`row_group_mb`
1	snappy	on	1024	256
2	zstd	off	64	256
3	snappy	on	64	256
4	snappy	on	1024	32
5	zstd	on	1024	32
6	zstd	off	1024	32
7	zstd	on	64	32
8	zstd	off	64	32
9	snappy	off	64	256
10	snappy	off	1024	32
11	zstd	on	64	256
12	zstd	on	1024	256
13	snappy	on	64	32
14	zstd	off	1024	256
15	snappy	off	64	32
16	snappy	off	1024	256

Step-by-Step Workflow

1

Preview the design

Terminal

$ doe info --config use_cases/41_columnar_compression/config.json

2

Generate the runner script

Terminal

$ doe generate --config use_cases/41_columnar_compression/config.json \
    --output use_cases/41_columnar_compression/results/run.sh --seed 42

3

Execute the experiments

Terminal

$ bash use_cases/41_columnar_compression/results/run.sh

4

Analyze results

Terminal

$ doe analyze --config use_cases/41_columnar_compression/config.json

5

Get optimization recommendations

Terminal

$ doe optimize --config use_cases/41_columnar_compression/config.json

6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal

$ doe optimize --config use_cases/41_columnar_compression/config.json --multi

7

Generate the HTML report

Terminal

$ doe report --config use_cases/41_columnar_compression/config.json \
    --output use_cases/41_columnar_compression/results/report.html

Features Exercised

Feature	Value
Design type	`full_factorial`
Factor types	`continuous` (2), `categorical` (2)
Arg style	`double-dash`
Responses	2 (compression_ratio ↑, read_speed_mbps ↑)
Total runs	16

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: compression_ratio

Top factors: dictionary (42.3%), codec (38.6%), row_group_mb (13.8%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
codec	1	11.6452	11.6452	2.171	0.2006
dictionary	1	13.9689	13.9689	2.604	0.1675
page_size_kb	1	0.2139	0.2139	0.040	0.8496
row_group_mb	1	1.4823	1.4823	0.276	0.6216
codec*dictionary	1	6.9828	6.9828	1.302	0.3056
codec*page_size_kb	1	4.5689	4.5689	0.852	0.3984
codec*row_group_mb	1	1.9252	1.9252	0.359	0.5752
dictionary*page_size_kb	1	0.3875	0.3875	0.072	0.7988
dictionary*row_group_mb	1	0.9950	0.9950	0.185	0.6846
page_size_kb*row_group_mb	1	0.6202	0.6202	0.116	0.7477
Error	5	26.8216	5.3643
Total	15	69.6114	4.6408

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for compression_ratio

Half-Normal Plot of Effects

Model Diagnostics

Response: read_speed_mbps

Top factors: dictionary (58.1%), row_group_mb (28.3%), page_size_kb (7.3%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
codec	1	1105.5625	1105.5625	0.022	0.8871
dictionary	1	96876.5625	96876.5625	1.957	0.2207
page_size_kb	1	1540.5625	1540.5625	0.031	0.8669
row_group_mb	1	23028.0625	23028.0625	0.465	0.5256
codec*dictionary	1	39303.0625	39303.0625	0.794	0.4138
codec*page_size_kb	1	26325.0625	26325.0625	0.532	0.4986
codec*row_group_mb	1	464783.0625	464783.0625	9.387	0.0280
dictionary*page_size_kb	1	28815.0625	28815.0625	0.582	0.4800
dictionary*row_group_mb	1	71422.5625	71422.5625	1.443	0.2835
page_size_kb*row_group_mb	1	41107.5625	41107.5625	0.830	0.4040
Error	5	247557.8125	49511.5625
Total	15	1041864.9375	69457.6625

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for read_speed_mbps

Half-Normal Plot of Effects

Model Diagnostics

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

compression ratio page size kb vs row group mb

read speed mbps page size kb vs row group mb

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.7010

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`compression_ratio`	1.0	0.4412	4.73 0.4412 4.73 x	↑
`read_speed_mbps`	1.5	0.9545	1323.00 0.9545 1323.00 MB/s	↑

Recommended Settings

Factor	Value
`codec`	zstd
`dictionary`	off
`page_size_kb`	64 KB
`row_group_mb`	256 MB

Source: from observed run #1

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`read_speed_mbps`	1323.00	1323.00	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#12	0.5858	codec=zstd, dictionary=on, page_size_kb=64, row_group_mb=32
#3	0.5342	codec=snappy, dictionary=off, page_size_kb=1024, row_group_mb=256

Model Quality

Response	R²	Type
`read_speed_mbps`	0.1315	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.7010

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
compression_ratio            1.0       0.4412        4.73 x   ↑
read_speed_mbps              1.5       0.9545     1323.00 MB/s   ↑

Recommended settings:
  codec = zstd
  dictionary = off
  page_size_kb = 64 KB
  row_group_mb = 256 MB
  (from observed run #1)

Trade-off summary:
  compression_ratio: 4.73 (best observed: 8.70, sacrifice: +3.97)
  read_speed_mbps: 1323.00 (best observed: 1323.00, sacrifice: +0.00)

Model quality:
  compression_ratio: R² = 0.3839 (linear)
  read_speed_mbps: R² = 0.1315 (linear)

Top 3 observed runs by overall desirability:
  1. Run #1 (D=0.7010): codec=zstd, dictionary=off, page_size_kb=64, row_group_mb=256
  2. Run #12 (D=0.5858): codec=zstd, dictionary=on, page_size_kb=64, row_group_mb=32
  3. Run #3 (D=0.5342): codec=snappy, dictionary=off, page_size_kb=1024, row_group_mb=256

Full Analysis Output

doe analyze
=== Main Effects: compression_ratio ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
dictionary              -1.8687       0.5386            42.3%
codec                    1.7063       0.5386            38.6%
row_group_mb            -0.6087       0.5386            13.8%
page_size_kb             0.2313       0.5386             5.2%

=== ANOVA Table: compression_ratio ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
codec                        1      11.6452      11.6452      2.171     0.2006
dictionary                   1      13.9689      13.9689      2.604     0.1675
page_size_kb                 1       0.2139       0.2139      0.040     0.8496
row_group_mb                 1       1.4823       1.4823      0.276     0.6216
codec*dictionary             1       6.9828       6.9828      1.302     0.3056
codec*page_size_kb           1       4.5689       4.5689      0.852     0.3984
codec*row_group_mb           1       1.9252       1.9252      0.359     0.5752
dictionary*page_size_kb      1       0.3875       0.3875      0.072     0.7988
dictionary*row_group_mb      1       0.9950       0.9950      0.185     0.6846
page_size_kb*row_group_mb    1       0.6202       0.6202      0.116     0.7477
Error                        5      26.8216       5.3643
Total                       15      69.6114       4.6408

=== Interaction Effects: compression_ratio ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
codec                dictionary                -1.3213            30.8%
codec                page_size_kb               1.0687            24.9%
codec                row_group_mb               0.6937            16.2%
dictionary           row_group_mb               0.4988            11.6%
page_size_kb         row_group_mb               0.3938             9.2%
dictionary           page_size_kb              -0.3112             7.3%

=== Summary Statistics: compression_ratio ===

codec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  snappy              8     3.6012     1.7388     1.6700     6.7300
  zstd                8     5.3075     2.2929     2.5300     8.7000

dictionary:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8     5.3887     2.5996     1.7700     8.7000
  on                  8     3.5200     1.0914     1.6700     4.8000

page_size_kb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1024                8     4.3388     1.9090     2.4400     7.5200
  64                  8     4.5700     2.5039     1.6700     8.7000

row_group_mb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 8     4.7588     2.1186     1.6700     8.7000
  32                  8     4.1500     2.2900     1.7700     7.6100

=== Main Effects: read_speed_mbps ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
dictionary             155.6250      65.8871            58.1%
row_group_mb            75.8750      65.8871            28.3%
page_size_kb            19.6250      65.8871             7.3%
codec                   16.6250      65.8871             6.2%

=== ANOVA Table: read_speed_mbps ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
codec                        1    1105.5625    1105.5625      0.022     0.8871
dictionary                   1   96876.5625   96876.5625      1.957     0.2207
page_size_kb                 1    1540.5625    1540.5625      0.031     0.8669
row_group_mb                 1   23028.0625   23028.0625      0.465     0.5256
codec*dictionary             1   39303.0625   39303.0625      0.794     0.4138
codec*page_size_kb           1   26325.0625   26325.0625      0.532     0.4986
codec*row_group_mb           1  464783.0625  464783.0625      9.387     0.0280
dictionary*page_size_kb      1   28815.0625   28815.0625      0.582     0.4800
dictionary*row_group_mb      1   71422.5625   71422.5625      1.443     0.2835
page_size_kb*row_group_mb    1   41107.5625   41107.5625      0.830     0.4040
Error                        5  247557.8125   49511.5625
Total                       15 1041864.9375   69457.6625

=== Interaction Effects: read_speed_mbps ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
codec                row_group_mb            -340.8750            40.5%
dictionary           row_group_mb             133.6250            15.9%
page_size_kb         row_group_mb            -101.3750            12.1%
codec                dictionary               -99.1250            11.8%
dictionary           page_size_kb              84.8750            10.1%
codec                page_size_kb             -81.1250             9.6%

=== Summary Statistics: read_speed_mbps ===

codec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  snappy              8   786.6250   295.3182   478.0000  1323.0000
  zstd                8   803.2500   247.9255   483.0000  1137.0000

dictionary:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  off                 8   717.1250   190.1011   531.0000  1077.0000
  on                  8   872.7500   314.4200   478.0000  1323.0000

page_size_kb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  1024                8   785.1250   254.8660   478.0000  1098.0000
  64                  8   804.7500   289.2422   483.0000  1323.0000

row_group_mb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 8   757.0000   247.3430   478.0000  1137.0000
  32                  8   832.8750   290.4644   483.0000  1323.0000

Optimization Recommendations

doe optimize
=== Optimization: compression_ratio ===
Direction: maximize

Best observed run: #12
  codec = snappy
  dictionary = on
  page_size_kb = 1024
  row_group_mb = 256
  Value: 8.7

RSM Model (linear, R² = 0.2908, Adj R² = 0.0329):
  Coefficients:
    intercept                      +4.4544
    codec                          -0.6481
    dictionary                     +0.7294
    page_size_kb                   -0.1919
    row_group_mb                   +0.5256

RSM Model (quadratic, R² = 0.7457, Adj R² = -2.8142):
  Coefficients:
    intercept                      +0.8909
    codec                          -0.6481
    dictionary                     +0.7294
    page_size_kb                   -0.1919
    row_group_mb                   +0.5256
    codec*dictionary               +0.5944
    codec*page_size_kb             -0.8369
    codec*row_group_mb             -0.3644
    dictionary*page_size_kb        -0.5244
    dictionary*row_group_mb        -0.0869
    page_size_kb*row_group_mb      +0.7144
    codec^2                        +0.8909
    dictionary^2                   +0.8909
    page_size_kb^2                 +0.8909
    row_group_mb^2                 +0.8909

  Curvature analysis:
    row_group_mb                   coef=+0.8909  convex (has a minimum)
    page_size_kb                   coef=+0.8909  convex (has a minimum)
    codec                          coef=+0.8909  convex (has a minimum)
    dictionary                     coef=+0.8909  convex (has a minimum)

  Notable interactions:
    codec*page_size_kb             coef=-0.8369  (antagonistic)
    page_size_kb*row_group_mb      coef=+0.7144  (synergistic)
    codec*dictionary               coef=+0.5944  (synergistic)
    dictionary*page_size_kb        coef=-0.5244  (antagonistic)
    codec*row_group_mb             coef=-0.3644  (antagonistic)

  Predicted optimum (from linear model, at observed points):
    codec = snappy
    dictionary = on
    page_size_kb = 64
    row_group_mb = 256
    Predicted value: 6.5494

  Surface optimum (via L-BFGS-B, linear model):
    codec = snappy
    dictionary = on
    page_size_kb = 64
    row_group_mb = 256
    Predicted value: 6.5494

  Model quality: Weak fit — consider adding center points or using a different design.

Factor importance:
  1. dictionary  (effect: 1.5, contribution: 34.8%)
  2. codec  (effect: -1.3, contribution: 30.9%)
  3. row_group_mb  (effect: -1.1, contribution: 25.1%)
  4. page_size_kb  (effect: 0.4, contribution: 9.2%)

=== Optimization: read_speed_mbps ===
Direction: maximize

Best observed run: #1
  codec = snappy
  dictionary = on
  page_size_kb = 64
  row_group_mb = 32
  Value: 1323.0

RSM Model (linear, R² = 0.4807, Adj R² = 0.2918):
  Coefficients:
    intercept                      +794.9375
    codec                          -33.4375
    dictionary                     +88.8125
    page_size_kb                   +69.1875
    row_group_mb                   -132.3125

RSM Model (quadratic, R² = 0.6351, Adj R² = -4.4734):
  Coefficients:
    intercept                      +158.9875
    codec                          -33.4375
    dictionary                     +88.8125
    page_size_kb                   +69.1875
    row_group_mb                   -132.3125
    codec*dictionary               -25.8125
    codec*page_size_kb             +4.0625
    codec*row_group_mb             +60.3125
    dictionary*page_size_kb        +64.5625
    dictionary*row_group_mb        -33.6875
    page_size_kb*row_group_mb      -20.8125
    codec^2                        +158.9875
    dictionary^2                   +158.9875
    page_size_kb^2                 +158.9875
    row_group_mb^2                 +158.9875

  Curvature analysis:
    codec                          coef=+158.9875  convex (has a minimum)
    page_size_kb                   coef=+158.9875  convex (has a minimum)
    dictionary                     coef=+158.9875  convex (has a minimum)
    row_group_mb                   coef=+158.9875  convex (has a minimum)

  Notable interactions:
    dictionary*page_size_kb        coef=+64.5625  (synergistic)
    codec*row_group_mb             coef=+60.3125  (synergistic)
    dictionary*row_group_mb        coef=-33.6875  (antagonistic)
    codec*dictionary               coef=-25.8125  (antagonistic)
    page_size_kb*row_group_mb      coef=-20.8125  (antagonistic)
    codec*page_size_kb             coef=+4.0625  (synergistic)

  Predicted optimum (from linear model, at observed points):
    codec = snappy
    dictionary = on
    page_size_kb = 1024
    row_group_mb = 32
    Predicted value: 1118.6875

  Surface optimum (via L-BFGS-B, linear model):
    codec = snappy
    dictionary = on
    page_size_kb = 1024
    row_group_mb = 32
    Predicted value: 1118.6875

  Model quality: Weak fit — consider adding center points or using a different design.

Factor importance:
  1. row_group_mb  (effect: 264.6, contribution: 40.9%)
  2. dictionary  (effect: 177.6, contribution: 27.4%)
  3. page_size_kb  (effect: -138.4, contribution: 21.4%)
  4. codec  (effect: -66.9, contribution: 10.3%)