← All Use Cases
📊
Fractional Factorial Design

ETL Batch Size Tuning

Fractional factorial of 5 ETL pipeline parameters for throughput and resource efficiency

Summary

This experiment investigates etl batch size tuning. Fractional factorial of 5 ETL pipeline parameters for throughput and resource efficiency.

The design varies 5 factors: batch size (rows), ranging from 1000 to 100000, writer threads (threads), ranging from 1 to 16, commit interval (rows), ranging from 1000 to 50000, transform mode, ranging from row to vectorized, and buffer mb (MB), ranging from 64 to 512. The goal is to optimize 2 responses: rows per sec (rows/s) (maximize) and peak memory gb (GB) (minimize). Fixed conditions held constant across all runs include source = postgresql, sink = s3_parquet.

A fractional factorial design reduces the number of runs from 32 to 8 by deliberately confounding higher-order interactions. This is ideal for screening — identifying which of the 5 factors matter most before investing in a full study.

Key Findings

For rows per sec, the most influential factors were commit interval (37.6%), transform mode (35.0%), writer threads (14.9%). The best observed value was 106862.0 (at batch size = 1000, writer threads = 1, commit interval = 50000).

For peak memory gb, the most influential factors were writer threads (41.2%), buffer mb (28.3%), commit interval (20.9%). The best observed value was 0.5 (at batch size = 100000, writer threads = 16, commit interval = 1000).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
batch_size1000100000rows
writer_threads116threads
commit_interval100050000rows
transform_moderowvectorized
buffer_mb64512MB

Fixed: source = postgresql, sink = s3_parquet

Responses

ResponseDirectionUnit
rows_per_sec↑ maximizerows/s
peak_memory_gb↓ minimizeGB

Configuration

use_cases/40_etl_batch_size_tuning/config.json
{ "metadata": { "name": "ETL Batch Size Tuning", "description": "Fractional factorial of 5 ETL pipeline parameters for throughput and resource efficiency" }, "factors": [ { "name": "batch_size", "levels": [ "1000", "100000" ], "type": "continuous", "unit": "rows" }, { "name": "writer_threads", "levels": [ "1", "16" ], "type": "continuous", "unit": "threads" }, { "name": "commit_interval", "levels": [ "1000", "50000" ], "type": "continuous", "unit": "rows" }, { "name": "transform_mode", "levels": [ "row", "vectorized" ], "type": "categorical", "unit": "" }, { "name": "buffer_mb", "levels": [ "64", "512" ], "type": "continuous", "unit": "MB" } ], "fixed_factors": { "source": "postgresql", "sink": "s3_parquet" }, "responses": [ { "name": "rows_per_sec", "optimize": "maximize", "unit": "rows/s" }, { "name": "peak_memory_gb", "optimize": "minimize", "unit": "GB" } ], "settings": { "operation": "fractional_factorial", "test_script": "use_cases/40_etl_batch_size_tuning/sim.sh" } }

Experimental Matrix

The Fractional Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.

Runbatch_sizewriter_threadscommit_intervaltransform_modebuffer_mb
110001650000row64
210000011000row64
3100000161000vectorized64
41000001650000vectorized512
51000161000row512
6100000150000row512
7100011000vectorized512
81000150000vectorized64

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/40_etl_batch_size_tuning/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/40_etl_batch_size_tuning/config.json \ --output use_cases/40_etl_batch_size_tuning/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/40_etl_batch_size_tuning/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/40_etl_batch_size_tuning/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/40_etl_batch_size_tuning/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/40_etl_batch_size_tuning/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/40_etl_batch_size_tuning/config.json \ --output use_cases/40_etl_batch_size_tuning/results/report.html

Features Exercised

FeatureValue
Design typefractional_factorial
Factor typescontinuous (4), categorical (1)
Arg styledouble-dash
Responses2 (rows_per_sec ↑, peak_memory_gb ↓)
Total runs8

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: rows_per_sec

Top factors: commit_interval (37.6%), transform_mode (35.0%), writer_threads (14.9%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
batch_size14283664.50004283664.50000.0300.8689
writer_threads1300835920.5000300835920.50002.1200.2051
commit_interval11908816684.50001908816684.500013.4540.0145
transform_mode11656001250.00001656001250.000011.6720.0189
buffer_mb1154528200.0000154528200.00001.0890.3445
batch_size*writer_threads11656001250.00001656001250.000011.6720.0189
batch_size*commit_interval1154528200.0000154528200.00001.0890.3445
batch_size*transform_mode1300835920.5000300835920.50002.1200.2051
batch_size*buffer_mb11908816684.50001908816684.500013.4540.0145
writer_threads*commit_interval194586258.000094586258.00000.6670.4513
writer_threads*transform_mode14283664.50004283664.50000.0300.8689
writer_threads*buffer_mb13685025400.50003685025400.500025.9730.0038
commit_interval*transform_mode13685025400.50003685025400.500025.9730.0038
commit_interval*buffer_mb14283664.50004283664.50000.0300.8689
transform_mode*buffer_mb194586258.000094586258.00000.6670.4513
Error(LenthPSE)5709396935.0000141879387.0000
Total77804077378.00001114868196.8571

Pareto Chart

Pareto chart for rows_per_sec

Main Effects Plot

Main effects plot for rows_per_sec

Normal Probability Plot of Effects

Normal probability plot for rows_per_sec

Half-Normal Plot of Effects

Half-normal plot for rows_per_sec

Model Diagnostics

Model diagnostics for rows_per_sec

Response: peak_memory_gb

Top factors: writer_threads (41.2%), buffer_mb (28.3%), commit_interval (20.9%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
batch_size10.01130.01130.0050.9469
writer_threads17.41137.41133.2270.1324
commit_interval11.90121.90120.8280.4046
transform_mode10.28120.28120.1220.7407
buffer_mb13.51133.51131.5290.2712
batch_size*writer_threads10.28120.28120.1220.7407
batch_size*commit_interval13.51133.51131.5290.2712
batch_size*transform_mode17.41127.41123.2270.1324
batch_size*buffer_mb11.90121.90120.8280.4046
writer_threads*commit_interval11.53131.53130.6670.4513
writer_threads*transform_mode10.01120.01120.0050.9469
writer_threads*buffer_mb122.111222.11129.6270.0268
commit_interval*transform_mode122.111322.11139.6270.0268
commit_interval*buffer_mb10.01130.01130.0050.9469
transform_mode*buffer_mb11.53131.53130.6670.4513
Error(LenthPSE)511.48442.2969
Total736.75875.2512

Pareto Chart

Pareto chart for peak_memory_gb

Main Effects Plot

Main effects plot for peak_memory_gb

Normal Probability Plot of Effects

Normal probability plot for peak_memory_gb

Half-Normal Plot of Effects

Half-normal plot for peak_memory_gb

Model Diagnostics

Model diagnostics for peak_memory_gb

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

peak memory gb batch size vs buffer mb

RSM surface: peak memory gb batch size vs buffer mb

peak memory gb batch size vs commit interval

RSM surface: peak memory gb batch size vs commit interval

peak memory gb batch size vs writer threads

RSM surface: peak memory gb batch size vs writer threads

peak memory gb commit interval vs buffer mb

RSM surface: peak memory gb commit interval vs buffer mb

peak memory gb writer threads vs buffer mb

RSM surface: peak memory gb writer threads vs buffer mb

peak memory gb writer threads vs commit interval

RSM surface: peak memory gb writer threads vs commit interval

rows per sec batch size vs buffer mb

RSM surface: rows per sec batch size vs buffer mb

rows per sec batch size vs commit interval

RSM surface: rows per sec batch size vs commit interval

rows per sec batch size vs writer threads

RSM surface: rows per sec batch size vs writer threads

rows per sec commit interval vs buffer mb

RSM surface: rows per sec commit interval vs buffer mb

rows per sec writer threads vs buffer mb

RSM surface: rows per sec writer threads vs buffer mb

rows per sec writer threads vs commit interval

RSM surface: rows per sec writer threads vs commit interval

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.6664

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
rows_per_sec 1.5
0.6897
80394.00 0.6897 80394.00 rows/s
peak_memory_gb 1.0
0.6329
2.80 0.6329 2.80 GB

Recommended Settings

FactorValue
batch_size1000 rows
writer_threads1 threads
commit_interval1000 rows
transform_modevectorized
buffer_mb512 MB

Source: from observed run #3

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
peak_memory_gb2.800.50+2.30

Top 3 Runs by Desirability

RunDFactor Settings
#60.3273batch_size=100000, writer_threads=1, commit_interval=1000, transform_mode=row, buffer_mb=64
#70.3023batch_size=1000, writer_threads=16, commit_interval=1000, transform_mode=row, buffer_mb=512

Model Quality

ResponseType
peak_memory_gb0.9956linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.6664 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- rows_per_sec 1.5 0.6897 80394.00 rows/s ↑ peak_memory_gb 1.0 0.6329 2.80 GB ↓ Recommended settings: batch_size = 1000 rows writer_threads = 1 threads commit_interval = 1000 rows transform_mode = vectorized buffer_mb = 512 MB (from observed run #3) Trade-off summary: rows_per_sec: 80394.00 (best observed: 106862.00, sacrifice: +26468.00) peak_memory_gb: 2.80 (best observed: 0.50, sacrifice: +2.30) Model quality: rows_per_sec: R² = 0.8304 (linear) peak_memory_gb: R² = 0.9956 (linear) Top 3 observed runs by overall desirability: 1. Run #3 (D=0.6664): batch_size=1000, writer_threads=1, commit_interval=1000, transform_mode=vectorized, buffer_mb=512 2. Run #6 (D=0.3273): batch_size=100000, writer_threads=1, commit_interval=1000, transform_mode=row, buffer_mb=64 3. Run #7 (D=0.3023): batch_size=1000, writer_threads=16, commit_interval=1000, transform_mode=row, buffer_mb=512

Full Analysis Output

doe analyze
=== Main Effects: rows_per_sec === Factor Effect Std Error % Contribution -------------------------------------------------------------- commit_interval 30893.5000 11805.0212 37.6% transform_mode 28775.0000 11805.0212 35.0% writer_threads 12264.5000 11805.0212 14.9% buffer_mb -8790.0000 11805.0212 10.7% batch_size -1463.5000 11805.0212 1.8% === ANOVA Table: rows_per_sec === Source DF SS MS F p-value ----------------------------------------------------------------------------- batch_size 1 4283664.5000 4283664.5000 0.030 0.8689 writer_threads 1 300835920.5000 300835920.5000 2.120 0.2051 commit_interval 1 1908816684.5000 1908816684.5000 13.454 0.0145 transform_mode 1 1656001250.0000 1656001250.0000 11.672 0.0189 buffer_mb 1 154528200.0000 154528200.0000 1.089 0.3445 batch_size*writer_threads 1 1656001250.0000 1656001250.0000 11.672 0.0189 batch_size*commit_interval 1 154528200.0000 154528200.0000 1.089 0.3445 batch_size*transform_mode 1 300835920.5000 300835920.5000 2.120 0.2051 batch_size*buffer_mb 1 1908816684.5000 1908816684.5000 13.454 0.0145 writer_threads*commit_interval 1 94586258.0000 94586258.0000 0.667 0.4513 writer_threads*transform_mode 1 4283664.5000 4283664.5000 0.030 0.8689 writer_threads*buffer_mb 1 3685025400.5000 3685025400.5000 25.973 0.0038 commit_interval*transform_mode 1 3685025400.5000 3685025400.5000 25.973 0.0038 commit_interval*buffer_mb 1 4283664.5000 4283664.5000 0.030 0.8689 transform_mode*buffer_mb 1 94586258.0000 94586258.0000 0.667 0.4513 Error (Lenth PSE) 5 709396935.0000 141879387.0000 Total 7 7804077378.0000 1114868196.8571 Note: Error estimated using Lenth's pseudo-standard-error (unreplicated design) === Interaction Effects: rows_per_sec === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ writer_threads buffer_mb -42924.5000 23.4% commit_interval transform_mode 42924.5000 23.4% batch_size buffer_mb -30893.5000 16.9% batch_size writer_threads 28775.0000 15.7% batch_size transform_mode 12264.5000 6.7% batch_size commit_interval 8790.0000 4.8% writer_threads commit_interval 6877.0000 3.8% transform_mode buffer_mb -6877.0000 3.8% writer_threads transform_mode -1463.5000 0.8% commit_interval buffer_mb 1463.5000 0.8% === Summary Statistics: rows_per_sec === batch_size: Level N Mean Std Min Max ------------------------------------------------------------ 1000 4 43063.2500 26208.0368 22243.0000 80394.0000 100000 4 41599.7500 43738.6562 16021.0000 106862.0000 writer_threads: Level N Mean Std Min Max ------------------------------------------------------------ 1 4 36199.2500 29756.3620 16021.0000 80394.0000 16 4 48463.7500 40195.0172 17377.0000 106862.0000 commit_interval: Level N Mean Std Min Max ------------------------------------------------------------ 1000 4 26884.7500 10557.2687 17377.0000 41780.0000 50000 4 57778.2500 43053.8149 16021.0000 106862.0000 transform_mode: Level N Mean Std Min Max ------------------------------------------------------------ row 4 27944.0000 10596.5799 16021.0000 41780.0000 vectorized 4 56719.0000 44012.1711 17377.0000 106862.0000 buffer_mb: Level N Mean Std Min Max ------------------------------------------------------------ 512 4 46726.5000 41565.2729 16021.0000 106862.0000 64 4 37936.5000 28673.6433 17377.0000 80394.0000 === Main Effects: peak_memory_gb === Factor Effect Std Error % Contribution -------------------------------------------------------------- writer_threads 1.9250 0.8102 41.2% buffer_mb -1.3250 0.8102 28.3% commit_interval 0.9750 0.8102 20.9% transform_mode 0.3750 0.8102 8.0% batch_size 0.0750 0.8102 1.6% === ANOVA Table: peak_memory_gb === Source DF SS MS F p-value ----------------------------------------------------------------------------- batch_size 1 0.0113 0.0113 0.005 0.9469 writer_threads 1 7.4113 7.4113 3.227 0.1324 commit_interval 1 1.9012 1.9012 0.828 0.4046 transform_mode 1 0.2812 0.2812 0.122 0.7407 buffer_mb 1 3.5113 3.5113 1.529 0.2712 batch_size*writer_threads 1 0.2812 0.2812 0.122 0.7407 batch_size*commit_interval 1 3.5113 3.5113 1.529 0.2712 batch_size*transform_mode 1 7.4112 7.4112 3.227 0.1324 batch_size*buffer_mb 1 1.9012 1.9012 0.828 0.4046 writer_threads*commit_interval 1 1.5313 1.5313 0.667 0.4513 writer_threads*transform_mode 1 0.0112 0.0112 0.005 0.9469 writer_threads*buffer_mb 1 22.1112 22.1112 9.627 0.0268 commit_interval*transform_mode 1 22.1113 22.1113 9.627 0.0268 commit_interval*buffer_mb 1 0.0113 0.0113 0.005 0.9469 transform_mode*buffer_mb 1 1.5313 1.5313 0.667 0.4513 Error (Lenth PSE) 5 11.4844 2.2969 Total 7 36.7587 5.2512 Note: Error estimated using Lenth's pseudo-standard-error (unreplicated design) === Interaction Effects: peak_memory_gb === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ writer_threads buffer_mb -3.3250 25.3% commit_interval transform_mode 3.3250 25.3% batch_size transform_mode 1.9250 14.6% batch_size commit_interval 1.3250 10.1% batch_size buffer_mb -0.9750 7.4% writer_threads commit_interval 0.8750 6.7% transform_mode buffer_mb -0.8750 6.7% batch_size writer_threads 0.3750 2.9% writer_threads transform_mode 0.0750 0.6% commit_interval buffer_mb -0.0750 0.6% === Summary Statistics: peak_memory_gb === batch_size: Level N Mean Std Min Max ------------------------------------------------------------ 1000 4 2.5250 1.6860 0.7000 4.7000 100000 4 2.6000 3.0670 0.5000 7.0000 writer_threads: Level N Mean Std Min Max ------------------------------------------------------------ 1 4 1.6000 1.1690 0.5000 2.8000 16 4 3.5250 2.9010 0.5000 7.0000 commit_interval: Level N Mean Std Min Max ------------------------------------------------------------ 1000 4 2.0750 1.9466 0.5000 4.7000 50000 4 3.0500 2.7982 0.5000 7.0000 transform_mode: Level N Mean Std Min Max ------------------------------------------------------------ row 4 2.3750 1.7462 0.5000 4.7000 vectorized 4 2.7500 3.0183 0.5000 7.0000 buffer_mb: Level N Mean Std Min Max ------------------------------------------------------------ 512 4 3.2250 3.1742 0.5000 7.0000 64 4 1.9000 1.0033 0.5000 2.8000

Optimization Recommendations

doe optimize
=== Optimization: rows_per_sec === Direction: maximize Best observed run: #4 batch_size = 1000 writer_threads = 1 commit_interval = 50000 transform_mode = vectorized buffer_mb = 64 Value: 106862.0 RSM Model (linear, R² = 0.7485, Adj R² = 0.1199): Coefficients: intercept +42331.5000 batch_size -1248.0000 writer_threads -17976.2500 commit_interval +19271.7500 transform_mode +5793.2500 buffer_mb +731.7500 Predicted optimum (from linear model, at observed points): batch_size = 1000 writer_threads = 1 commit_interval = 50000 transform_mode = vectorized buffer_mb = 64 Predicted value: 85889.0000 Surface optimum (via L-BFGS-B, linear model): batch_size = 1000 writer_threads = 1 commit_interval = 50000 transform_mode = vectorized buffer_mb = 512 Predicted value: 87352.5000 Model quality: Good fit — general trends are captured, some noise remains. Factor importance: 1. commit_interval (effect: 38543.5, contribution: 42.8%) 2. writer_threads (effect: -35952.5, contribution: 39.9%) 3. transform_mode (effect: 11586.5, contribution: 12.9%) 4. batch_size (effect: -2496.0, contribution: 2.8%) 5. buffer_mb (effect: -1463.5, contribution: 1.6%) === Optimization: peak_memory_gb === Direction: minimize Best observed run: #1 batch_size = 100000 writer_threads = 16 commit_interval = 1000 transform_mode = vectorized buffer_mb = 64 Value: 0.5 RSM Model (linear, R² = 0.7107, Adj R² = -0.0124): Coefficients: intercept +2.5625 batch_size +0.0375 writer_threads -0.9625 commit_interval +1.1875 transform_mode +0.9625 buffer_mb -0.0375 Predicted optimum (from linear model, at observed points): batch_size = 1000 writer_threads = 1 commit_interval = 50000 transform_mode = vectorized buffer_mb = 64 Predicted value: 5.6750 Surface optimum (via L-BFGS-B, linear model): batch_size = 1000 writer_threads = 16 commit_interval = 1000 transform_mode = row buffer_mb = 512 Predicted value: -0.6250 Model quality: Good fit — general trends are captured, some noise remains. Factor importance: 1. commit_interval (effect: 2.4, contribution: 37.3%) 2. transform_mode (effect: 1.9, contribution: 30.2%) 3. writer_threads (effect: -1.9, contribution: 30.2%) 4. batch_size (effect: 0.1, contribution: 1.2%) 5. buffer_mb (effect: 0.1, contribution: 1.2%)
← Previous: Stream Processing Windowing Next: Columnar Compression →