Summary
This experiment investigates etl batch size tuning. Fractional factorial of 5 ETL pipeline parameters for throughput and resource efficiency.
The design varies 5 factors: batch size (rows), ranging from 1000 to 100000, writer threads (threads), ranging from 1 to 16, commit interval (rows), ranging from 1000 to 50000, transform mode, ranging from row to vectorized, and buffer mb (MB), ranging from 64 to 512. The goal is to optimize 2 responses: rows per sec (rows/s) (maximize) and peak memory gb (GB) (minimize). Fixed conditions held constant across all runs include source = postgresql, sink = s3_parquet.
A fractional factorial design reduces the number of runs from 32 to 8 by deliberately confounding higher-order interactions. This is ideal for screening — identifying which of the 5 factors matter most before investing in a full study.
Key Findings
For rows per sec, the most influential factors were commit interval (37.6%), transform mode (35.0%), writer threads (14.9%). The best observed value was 106862.0 (at batch size = 1000, writer threads = 1, commit interval = 50000).
For peak memory gb, the most influential factors were writer threads (41.2%), buffer mb (28.3%), commit interval (20.9%). The best observed value was 0.5 (at batch size = 100000, writer threads = 16, commit interval = 1000).
Recommended Next Steps
- Follow up with a response surface design (CCD or Box-Behnken) on the top 3–4 factors to model curvature and find the true optimum.
- Consider whether any fixed factors should be varied in a future study.
- The screening results can guide factor reduction — drop factors contributing less than 5% and re-run with a smaller, more focused design.
Experimental Setup
Factors
| Factor | Low | High | Unit |
batch_size | 1000 | 100000 | rows |
writer_threads | 1 | 16 | threads |
commit_interval | 1000 | 50000 | rows |
transform_mode | row | vectorized | |
buffer_mb | 64 | 512 | MB |
Fixed: source = postgresql, sink = s3_parquet
Responses
| Response | Direction | Unit |
rows_per_sec | ↑ maximize | rows/s |
peak_memory_gb | ↓ minimize | GB |
Configuration
{
"metadata": {
"name": "ETL Batch Size Tuning",
"description": "Fractional factorial of 5 ETL pipeline parameters for throughput and resource efficiency"
},
"factors": [
{
"name": "batch_size",
"levels": [
"1000",
"100000"
],
"type": "continuous",
"unit": "rows"
},
{
"name": "writer_threads",
"levels": [
"1",
"16"
],
"type": "continuous",
"unit": "threads"
},
{
"name": "commit_interval",
"levels": [
"1000",
"50000"
],
"type": "continuous",
"unit": "rows"
},
{
"name": "transform_mode",
"levels": [
"row",
"vectorized"
],
"type": "categorical",
"unit": ""
},
{
"name": "buffer_mb",
"levels": [
"64",
"512"
],
"type": "continuous",
"unit": "MB"
}
],
"fixed_factors": {
"source": "postgresql",
"sink": "s3_parquet"
},
"responses": [
{
"name": "rows_per_sec",
"optimize": "maximize",
"unit": "rows/s"
},
{
"name": "peak_memory_gb",
"optimize": "minimize",
"unit": "GB"
}
],
"settings": {
"operation": "fractional_factorial",
"test_script": "use_cases/40_etl_batch_size_tuning/sim.sh"
}
}
Experimental Matrix
The Fractional Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.
| Run | batch_size | writer_threads | commit_interval | transform_mode | buffer_mb |
| 1 | 1000 | 16 | 50000 | row | 64 |
| 2 | 100000 | 1 | 1000 | row | 64 |
| 3 | 100000 | 16 | 1000 | vectorized | 64 |
| 4 | 100000 | 16 | 50000 | vectorized | 512 |
| 5 | 1000 | 16 | 1000 | row | 512 |
| 6 | 100000 | 1 | 50000 | row | 512 |
| 7 | 1000 | 1 | 1000 | vectorized | 512 |
| 8 | 1000 | 1 | 50000 | vectorized | 64 |
Step-by-Step Workflow
1
Preview the design
$ doe info --config use_cases/40_etl_batch_size_tuning/config.json
2
Generate the runner script
$ doe generate --config use_cases/40_etl_batch_size_tuning/config.json \
--output use_cases/40_etl_batch_size_tuning/results/run.sh --seed 42
3
Execute the experiments
$ bash use_cases/40_etl_batch_size_tuning/results/run.sh
4
Analyze results
$ doe analyze --config use_cases/40_etl_batch_size_tuning/config.json
5
Get optimization recommendations
$ doe optimize --config use_cases/40_etl_batch_size_tuning/config.json
6
Multi-objective optimization
With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.
$ doe optimize --config use_cases/40_etl_batch_size_tuning/config.json --multi
7
Generate the HTML report
$ doe report --config use_cases/40_etl_batch_size_tuning/config.json \
--output use_cases/40_etl_batch_size_tuning/results/report.html
Features Exercised
| Feature | Value |
| Design type | fractional_factorial |
| Factor types | continuous (4), categorical (1) |
| Arg style | double-dash |
| Responses | 2 (rows_per_sec ↑, peak_memory_gb ↓) |
| Total runs | 8 |
Analysis Results
Generated from actual experiment runs using the DOE Helper Tool.
Response: rows_per_sec
Top factors: commit_interval (37.6%), transform_mode (35.0%), writer_threads (14.9%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| batch_size | 1 | 4283664.5000 | 4283664.5000 | 0.030 | 0.8689 |
| writer_threads | 1 | 300835920.5000 | 300835920.5000 | 2.120 | 0.2051 |
| commit_interval | 1 | 1908816684.5000 | 1908816684.5000 | 13.454 | 0.0145 |
| transform_mode | 1 | 1656001250.0000 | 1656001250.0000 | 11.672 | 0.0189 |
| buffer_mb | 1 | 154528200.0000 | 154528200.0000 | 1.089 | 0.3445 |
| batch_size*writer_threads | 1 | 1656001250.0000 | 1656001250.0000 | 11.672 | 0.0189 |
| batch_size*commit_interval | 1 | 154528200.0000 | 154528200.0000 | 1.089 | 0.3445 |
| batch_size*transform_mode | 1 | 300835920.5000 | 300835920.5000 | 2.120 | 0.2051 |
| batch_size*buffer_mb | 1 | 1908816684.5000 | 1908816684.5000 | 13.454 | 0.0145 |
| writer_threads*commit_interval | 1 | 94586258.0000 | 94586258.0000 | 0.667 | 0.4513 |
| writer_threads*transform_mode | 1 | 4283664.5000 | 4283664.5000 | 0.030 | 0.8689 |
| writer_threads*buffer_mb | 1 | 3685025400.5000 | 3685025400.5000 | 25.973 | 0.0038 |
| commit_interval*transform_mode | 1 | 3685025400.5000 | 3685025400.5000 | 25.973 | 0.0038 |
| commit_interval*buffer_mb | 1 | 4283664.5000 | 4283664.5000 | 0.030 | 0.8689 |
| transform_mode*buffer_mb | 1 | 94586258.0000 | 94586258.0000 | 0.667 | 0.4513 |
| Error | (Lenth | PSE) | 5 | 709396935.0000 | 141879387.0000 |
| Total | 7 | 7804077378.0000 | 1114868196.8571 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response: peak_memory_gb
Top factors: writer_threads (41.2%), buffer_mb (28.3%), commit_interval (20.9%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| batch_size | 1 | 0.0113 | 0.0113 | 0.005 | 0.9469 |
| writer_threads | 1 | 7.4113 | 7.4113 | 3.227 | 0.1324 |
| commit_interval | 1 | 1.9012 | 1.9012 | 0.828 | 0.4046 |
| transform_mode | 1 | 0.2812 | 0.2812 | 0.122 | 0.7407 |
| buffer_mb | 1 | 3.5113 | 3.5113 | 1.529 | 0.2712 |
| batch_size*writer_threads | 1 | 0.2812 | 0.2812 | 0.122 | 0.7407 |
| batch_size*commit_interval | 1 | 3.5113 | 3.5113 | 1.529 | 0.2712 |
| batch_size*transform_mode | 1 | 7.4112 | 7.4112 | 3.227 | 0.1324 |
| batch_size*buffer_mb | 1 | 1.9012 | 1.9012 | 0.828 | 0.4046 |
| writer_threads*commit_interval | 1 | 1.5313 | 1.5313 | 0.667 | 0.4513 |
| writer_threads*transform_mode | 1 | 0.0112 | 0.0112 | 0.005 | 0.9469 |
| writer_threads*buffer_mb | 1 | 22.1112 | 22.1112 | 9.627 | 0.0268 |
| commit_interval*transform_mode | 1 | 22.1113 | 22.1113 | 9.627 | 0.0268 |
| commit_interval*buffer_mb | 1 | 0.0113 | 0.0113 | 0.005 | 0.9469 |
| transform_mode*buffer_mb | 1 | 1.5313 | 1.5313 | 0.667 | 0.4513 |
| Error | (Lenth | PSE) | 5 | 11.4844 | 2.2969 |
| Total | 7 | 36.7587 | 5.2512 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
peak memory gb batch size vs buffer mb
peak memory gb batch size vs commit interval
peak memory gb batch size vs writer threads
peak memory gb commit interval vs buffer mb
peak memory gb writer threads vs buffer mb
peak memory gb writer threads vs commit interval
rows per sec batch size vs buffer mb
rows per sec batch size vs commit interval
rows per sec batch size vs writer threads
rows per sec commit interval vs buffer mb
rows per sec writer threads vs buffer mb
rows per sec writer threads vs commit interval
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.6664
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
rows_per_sec |
1.5 |
|
80394.00 0.6897 80394.00 rows/s |
↑ |
peak_memory_gb |
1.0 |
|
2.80 0.6329 2.80 GB |
↓ |
Recommended Settings
| Factor | Value |
batch_size | 1000 rows |
writer_threads | 1 threads |
commit_interval | 1000 rows |
transform_mode | vectorized |
buffer_mb | 512 MB |
Source: from observed run #3
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
peak_memory_gb | 2.80 | 0.50 | +2.30 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #6 | 0.3273 | batch_size=100000, writer_threads=1, commit_interval=1000, transform_mode=row, buffer_mb=64 |
| #7 | 0.3023 | batch_size=1000, writer_threads=16, commit_interval=1000, transform_mode=row, buffer_mb=512 |
Model Quality
| Response | R² | Type |
peak_memory_gb | 0.9956 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.6664
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
rows_per_sec 1.5 0.6897 80394.00 rows/s ↑
peak_memory_gb 1.0 0.6329 2.80 GB ↓
Recommended settings:
batch_size = 1000 rows
writer_threads = 1 threads
commit_interval = 1000 rows
transform_mode = vectorized
buffer_mb = 512 MB
(from observed run #3)
Trade-off summary:
rows_per_sec: 80394.00 (best observed: 106862.00, sacrifice: +26468.00)
peak_memory_gb: 2.80 (best observed: 0.50, sacrifice: +2.30)
Model quality:
rows_per_sec: R² = 0.8304 (linear)
peak_memory_gb: R² = 0.9956 (linear)
Top 3 observed runs by overall desirability:
1. Run #3 (D=0.6664): batch_size=1000, writer_threads=1, commit_interval=1000, transform_mode=vectorized, buffer_mb=512
2. Run #6 (D=0.3273): batch_size=100000, writer_threads=1, commit_interval=1000, transform_mode=row, buffer_mb=64
3. Run #7 (D=0.3023): batch_size=1000, writer_threads=16, commit_interval=1000, transform_mode=row, buffer_mb=512
Full Analysis Output
=== Main Effects: rows_per_sec ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
commit_interval 30893.5000 11805.0212 37.6%
transform_mode 28775.0000 11805.0212 35.0%
writer_threads 12264.5000 11805.0212 14.9%
buffer_mb -8790.0000 11805.0212 10.7%
batch_size -1463.5000 11805.0212 1.8%
=== ANOVA Table: rows_per_sec ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
batch_size 1 4283664.5000 4283664.5000 0.030 0.8689
writer_threads 1 300835920.5000 300835920.5000 2.120 0.2051
commit_interval 1 1908816684.5000 1908816684.5000 13.454 0.0145
transform_mode 1 1656001250.0000 1656001250.0000 11.672 0.0189
buffer_mb 1 154528200.0000 154528200.0000 1.089 0.3445
batch_size*writer_threads 1 1656001250.0000 1656001250.0000 11.672 0.0189
batch_size*commit_interval 1 154528200.0000 154528200.0000 1.089 0.3445
batch_size*transform_mode 1 300835920.5000 300835920.5000 2.120 0.2051
batch_size*buffer_mb 1 1908816684.5000 1908816684.5000 13.454 0.0145
writer_threads*commit_interval 1 94586258.0000 94586258.0000 0.667 0.4513
writer_threads*transform_mode 1 4283664.5000 4283664.5000 0.030 0.8689
writer_threads*buffer_mb 1 3685025400.5000 3685025400.5000 25.973 0.0038
commit_interval*transform_mode 1 3685025400.5000 3685025400.5000 25.973 0.0038
commit_interval*buffer_mb 1 4283664.5000 4283664.5000 0.030 0.8689
transform_mode*buffer_mb 1 94586258.0000 94586258.0000 0.667 0.4513
Error (Lenth PSE) 5 709396935.0000 141879387.0000
Total 7 7804077378.0000 1114868196.8571
Note: Error estimated using Lenth's pseudo-standard-error (unreplicated design)
=== Interaction Effects: rows_per_sec ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
writer_threads buffer_mb -42924.5000 23.4%
commit_interval transform_mode 42924.5000 23.4%
batch_size buffer_mb -30893.5000 16.9%
batch_size writer_threads 28775.0000 15.7%
batch_size transform_mode 12264.5000 6.7%
batch_size commit_interval 8790.0000 4.8%
writer_threads commit_interval 6877.0000 3.8%
transform_mode buffer_mb -6877.0000 3.8%
writer_threads transform_mode -1463.5000 0.8%
commit_interval buffer_mb 1463.5000 0.8%
=== Summary Statistics: rows_per_sec ===
batch_size:
Level N Mean Std Min Max
------------------------------------------------------------
1000 4 43063.2500 26208.0368 22243.0000 80394.0000
100000 4 41599.7500 43738.6562 16021.0000 106862.0000
writer_threads:
Level N Mean Std Min Max
------------------------------------------------------------
1 4 36199.2500 29756.3620 16021.0000 80394.0000
16 4 48463.7500 40195.0172 17377.0000 106862.0000
commit_interval:
Level N Mean Std Min Max
------------------------------------------------------------
1000 4 26884.7500 10557.2687 17377.0000 41780.0000
50000 4 57778.2500 43053.8149 16021.0000 106862.0000
transform_mode:
Level N Mean Std Min Max
------------------------------------------------------------
row 4 27944.0000 10596.5799 16021.0000 41780.0000
vectorized 4 56719.0000 44012.1711 17377.0000 106862.0000
buffer_mb:
Level N Mean Std Min Max
------------------------------------------------------------
512 4 46726.5000 41565.2729 16021.0000 106862.0000
64 4 37936.5000 28673.6433 17377.0000 80394.0000
=== Main Effects: peak_memory_gb ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
writer_threads 1.9250 0.8102 41.2%
buffer_mb -1.3250 0.8102 28.3%
commit_interval 0.9750 0.8102 20.9%
transform_mode 0.3750 0.8102 8.0%
batch_size 0.0750 0.8102 1.6%
=== ANOVA Table: peak_memory_gb ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
batch_size 1 0.0113 0.0113 0.005 0.9469
writer_threads 1 7.4113 7.4113 3.227 0.1324
commit_interval 1 1.9012 1.9012 0.828 0.4046
transform_mode 1 0.2812 0.2812 0.122 0.7407
buffer_mb 1 3.5113 3.5113 1.529 0.2712
batch_size*writer_threads 1 0.2812 0.2812 0.122 0.7407
batch_size*commit_interval 1 3.5113 3.5113 1.529 0.2712
batch_size*transform_mode 1 7.4112 7.4112 3.227 0.1324
batch_size*buffer_mb 1 1.9012 1.9012 0.828 0.4046
writer_threads*commit_interval 1 1.5313 1.5313 0.667 0.4513
writer_threads*transform_mode 1 0.0112 0.0112 0.005 0.9469
writer_threads*buffer_mb 1 22.1112 22.1112 9.627 0.0268
commit_interval*transform_mode 1 22.1113 22.1113 9.627 0.0268
commit_interval*buffer_mb 1 0.0113 0.0113 0.005 0.9469
transform_mode*buffer_mb 1 1.5313 1.5313 0.667 0.4513
Error (Lenth PSE) 5 11.4844 2.2969
Total 7 36.7587 5.2512
Note: Error estimated using Lenth's pseudo-standard-error (unreplicated design)
=== Interaction Effects: peak_memory_gb ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
writer_threads buffer_mb -3.3250 25.3%
commit_interval transform_mode 3.3250 25.3%
batch_size transform_mode 1.9250 14.6%
batch_size commit_interval 1.3250 10.1%
batch_size buffer_mb -0.9750 7.4%
writer_threads commit_interval 0.8750 6.7%
transform_mode buffer_mb -0.8750 6.7%
batch_size writer_threads 0.3750 2.9%
writer_threads transform_mode 0.0750 0.6%
commit_interval buffer_mb -0.0750 0.6%
=== Summary Statistics: peak_memory_gb ===
batch_size:
Level N Mean Std Min Max
------------------------------------------------------------
1000 4 2.5250 1.6860 0.7000 4.7000
100000 4 2.6000 3.0670 0.5000 7.0000
writer_threads:
Level N Mean Std Min Max
------------------------------------------------------------
1 4 1.6000 1.1690 0.5000 2.8000
16 4 3.5250 2.9010 0.5000 7.0000
commit_interval:
Level N Mean Std Min Max
------------------------------------------------------------
1000 4 2.0750 1.9466 0.5000 4.7000
50000 4 3.0500 2.7982 0.5000 7.0000
transform_mode:
Level N Mean Std Min Max
------------------------------------------------------------
row 4 2.3750 1.7462 0.5000 4.7000
vectorized 4 2.7500 3.0183 0.5000 7.0000
buffer_mb:
Level N Mean Std Min Max
------------------------------------------------------------
512 4 3.2250 3.1742 0.5000 7.0000
64 4 1.9000 1.0033 0.5000 2.8000
Optimization Recommendations
=== Optimization: rows_per_sec ===
Direction: maximize
Best observed run: #4
batch_size = 1000
writer_threads = 1
commit_interval = 50000
transform_mode = vectorized
buffer_mb = 64
Value: 106862.0
RSM Model (linear, R² = 0.7485, Adj R² = 0.1199):
Coefficients:
intercept +42331.5000
batch_size -1248.0000
writer_threads -17976.2500
commit_interval +19271.7500
transform_mode +5793.2500
buffer_mb +731.7500
Predicted optimum (from linear model, at observed points):
batch_size = 1000
writer_threads = 1
commit_interval = 50000
transform_mode = vectorized
buffer_mb = 64
Predicted value: 85889.0000
Surface optimum (via L-BFGS-B, linear model):
batch_size = 1000
writer_threads = 1
commit_interval = 50000
transform_mode = vectorized
buffer_mb = 512
Predicted value: 87352.5000
Model quality: Good fit — general trends are captured, some noise remains.
Factor importance:
1. commit_interval (effect: 38543.5, contribution: 42.8%)
2. writer_threads (effect: -35952.5, contribution: 39.9%)
3. transform_mode (effect: 11586.5, contribution: 12.9%)
4. batch_size (effect: -2496.0, contribution: 2.8%)
5. buffer_mb (effect: -1463.5, contribution: 1.6%)
=== Optimization: peak_memory_gb ===
Direction: minimize
Best observed run: #1
batch_size = 100000
writer_threads = 16
commit_interval = 1000
transform_mode = vectorized
buffer_mb = 64
Value: 0.5
RSM Model (linear, R² = 0.7107, Adj R² = -0.0124):
Coefficients:
intercept +2.5625
batch_size +0.0375
writer_threads -0.9625
commit_interval +1.1875
transform_mode +0.9625
buffer_mb -0.0375
Predicted optimum (from linear model, at observed points):
batch_size = 1000
writer_threads = 1
commit_interval = 50000
transform_mode = vectorized
buffer_mb = 64
Predicted value: 5.6750
Surface optimum (via L-BFGS-B, linear model):
batch_size = 1000
writer_threads = 16
commit_interval = 1000
transform_mode = row
buffer_mb = 512
Predicted value: -0.6250
Model quality: Good fit — general trends are captured, some noise remains.
Factor importance:
1. commit_interval (effect: 2.4, contribution: 37.3%)
2. transform_mode (effect: 1.9, contribution: 30.2%)
3. writer_threads (effect: -1.9, contribution: 30.2%)
4. batch_size (effect: 0.1, contribution: 1.2%)
5. buffer_mb (effect: 0.1, contribution: 1.2%)