Summary
This experiment investigates spark shuffle optimization. Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time.
The design varies 3 factors: shuffle partitions (count), ranging from 50 to 500, shuffle buffer kb (KB), ranging from 32 to 256, and compress codec, ranging from lz4 to zstd. The goal is to optimize 2 responses: job time min (min) (minimize) and shuffle spill gb (GB) (minimize). Fixed conditions held constant across all runs include executor memory = 8g, executor cores = 4.
A full factorial design was used to explore all 8 possible combinations of the 3 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (8 runs).
Key Findings
For job time min, the most influential factors were compress codec (64.2%), shuffle buffer kb (32.2%), shuffle partitions (3.6%). The best observed value was 22.9 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = lz4).
For shuffle spill gb, the most influential factors were shuffle partitions (51.5%), compress codec (39.8%), shuffle buffer kb (8.7%). The best observed value was 7.3 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = zstd).
Recommended Next Steps
- Consider whether any fixed factors should be varied in a future study.
Experimental Setup
Factors
| Factor | Low | High | Unit |
shuffle_partitions | 50 | 500 | count |
shuffle_buffer_kb | 32 | 256 | KB |
compress_codec | lz4 | zstd | |
Fixed: executor_memory = 8g, executor_cores = 4
Responses
| Response | Direction | Unit |
job_time_min | ↓ minimize | min |
shuffle_spill_gb | ↓ minimize | GB |
Configuration
{
"metadata": {
"name": "Spark Shuffle Optimization",
"description": "Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time"
},
"factors": [
{
"name": "shuffle_partitions",
"levels": [
"50",
"500"
],
"type": "continuous",
"unit": "count"
},
{
"name": "shuffle_buffer_kb",
"levels": [
"32",
"256"
],
"type": "continuous",
"unit": "KB"
},
{
"name": "compress_codec",
"levels": [
"lz4",
"zstd"
],
"type": "categorical",
"unit": ""
}
],
"fixed_factors": {
"executor_memory": "8g",
"executor_cores": "4"
},
"responses": [
{
"name": "job_time_min",
"optimize": "minimize",
"unit": "min"
},
{
"name": "shuffle_spill_gb",
"optimize": "minimize",
"unit": "GB"
}
],
"settings": {
"operation": "full_factorial",
"test_script": "use_cases/37_spark_shuffle_optimization/sim.sh"
}
}
Experimental Matrix
The Full Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.
| Run | shuffle_partitions | shuffle_buffer_kb | compress_codec |
| 1 | 50 | 256 | zstd |
| 2 | 500 | 32 | lz4 |
| 3 | 500 | 256 | lz4 |
| 4 | 500 | 256 | zstd |
| 5 | 50 | 256 | lz4 |
| 6 | 500 | 32 | zstd |
| 7 | 50 | 32 | lz4 |
| 8 | 50 | 32 | zstd |
Step-by-Step Workflow
1
Preview the design
$ doe info --config use_cases/37_spark_shuffle_optimization/config.json
2
Generate the runner script
$ doe generate --config use_cases/37_spark_shuffle_optimization/config.json \
--output use_cases/37_spark_shuffle_optimization/results/run.sh --seed 42
3
Execute the experiments
$ bash use_cases/37_spark_shuffle_optimization/results/run.sh
4
Analyze results
$ doe analyze --config use_cases/37_spark_shuffle_optimization/config.json
5
Get optimization recommendations
$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json
6
Multi-objective optimization
With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.
$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json --multi
7
Generate the HTML report
$ doe report --config use_cases/37_spark_shuffle_optimization/config.json \
--output use_cases/37_spark_shuffle_optimization/results/report.html
Features Exercised
| Feature | Value |
| Design type | full_factorial |
| Factor types | continuous (2), categorical (1) |
| Arg style | double-dash |
| Responses | 2 (job_time_min ↓, shuffle_spill_gb ↓) |
| Total runs | 8 |
Analysis Results
Generated from actual experiment runs using the DOE Helper Tool.
Response: job_time_min
Top factors: compress_codec (64.2%), shuffle_buffer_kb (32.2%), shuffle_partitions (3.6%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| shuffle_partitions | 1 | 0.6613 | 0.6613 | 0.017 | 0.9186 |
| shuffle_buffer_kb | 1 | 51.5113 | 51.5113 | 1.286 | 0.4601 |
| compress_codec | 1 | 205.0312 | 205.0312 | 5.119 | 0.2649 |
| shuffle_partitions*shuffle_buffer_kb | 1 | 0.6612 | 0.6612 | 0.017 | 0.9186 |
| shuffle_partitions*compress_codec | 1 | 4.0613 | 4.0613 | 0.101 | 0.8037 |
| shuffle_buffer_kb*compress_codec | 1 | 57.7812 | 57.7812 | 1.443 | 0.4420 |
| Error | 1 | 40.0513 | 40.0513 | | |
| Total | 7 | 359.7587 | 51.3941 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response: shuffle_spill_gb
Top factors: shuffle_partitions (51.5%), compress_codec (39.8%), shuffle_buffer_kb (8.7%).
ANOVA
| Source | DF | SS | MS | F | p-value |
| Source | DF | SS | MS | F | p-value |
| shuffle_partitions | 1 | 3.5113 | 3.5113 | 0.009 | 0.9394 |
| shuffle_buffer_kb | 1 | 0.1013 | 0.1013 | 0.000 | 0.9897 |
| compress_codec | 1 | 2.1012 | 2.1012 | 0.005 | 0.9531 |
| shuffle_partitions*shuffle_buffer_kb | 1 | 0.6612 | 0.6612 | 0.002 | 0.9736 |
| shuffle_partitions*compress_codec | 1 | 63.2813 | 63.2813 | 0.164 | 0.7548 |
| shuffle_buffer_kb*compress_codec | 1 | 91.8012 | 91.8012 | 0.238 | 0.7108 |
| Error | 1 | 385.0312 | 385.0312 | | |
| Total | 7 | 546.4887 | 78.0698 | | |
Pareto Chart
Main Effects Plot
Normal Probability Plot of Effects
Half-Normal Plot of Effects
Model Diagnostics
Response Surface Plots
3D surfaces fitted with quadratic RSM. Red dots are observed data points.
job time min shuffle partitions vs shuffle buffer kb
shuffle spill gb shuffle partitions vs shuffle buffer kb
Multi-Objective Optimization
When responses compete, Derringer–Suich desirability finds the best compromise.
Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.
Overall Desirability
D = 0.8947
Per-Response Desirability
| Response | Weight | Desirability | Predicted | Dir |
job_time_min |
1.5 |
|
25.20 0.8568 25.20 min |
↓ |
shuffle_spill_gb |
1.0 |
|
7.30 0.9545 7.30 GB |
↓ |
Recommended Settings
| Factor | Value |
shuffle_partitions | 500 count |
shuffle_buffer_kb | 32 KB |
compress_codec | lz4 |
Source: from observed run #4
Trade-off Summary
Sacrifice = how much worse than single-objective best.
| Response | Predicted | Best Observed | Sacrifice |
shuffle_spill_gb | 7.30 | 7.30 | +0.00 |
Top 3 Runs by Desirability
| Run | D | Factor Settings |
| #6 | 0.7811 | shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4 |
| #3 | 0.7251 | shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4 |
Model Quality
| Response | R² | Type |
shuffle_spill_gb | 0.7547 | linear |
Full Multi-Objective Output
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================
Overall desirability: D = 0.8947
Response Weight Desirability Predicted Direction
---------------------------------------------------------------------
job_time_min 1.5 0.8568 25.20 min ↓
shuffle_spill_gb 1.0 0.9545 7.30 GB ↓
Recommended settings:
shuffle_partitions = 500 count
shuffle_buffer_kb = 32 KB
compress_codec = lz4
(from observed run #4)
Trade-off summary:
job_time_min: 25.20 (best observed: 22.90, sacrifice: +2.30)
shuffle_spill_gb: 7.30 (best observed: 7.30, sacrifice: +0.00)
Model quality:
job_time_min: R² = 0.2419 (linear)
shuffle_spill_gb: R² = 0.7547 (linear)
Top 3 observed runs by overall desirability:
1. Run #4 (D=0.8947): shuffle_partitions=500, shuffle_buffer_kb=32, compress_codec=lz4
2. Run #6 (D=0.7811): shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4
3. Run #3 (D=0.7251): shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4
Full Analysis Output
=== Main Effects: job_time_min ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
compress_codec 10.1250 2.5346 64.2%
shuffle_buffer_kb 5.0750 2.5346 32.2%
shuffle_partitions -0.5750 2.5346 3.6%
=== ANOVA Table: job_time_min ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
shuffle_partitions 1 0.6613 0.6613 0.017 0.9186
shuffle_buffer_kb 1 51.5113 51.5113 1.286 0.4601
compress_codec 1 205.0312 205.0312 5.119 0.2649
shuffle_partitions*shuffle_buffer_kb 1 0.6612 0.6612 0.017 0.9186
shuffle_partitions*compress_codec 1 4.0613 4.0613 0.101 0.8037
shuffle_buffer_kb*compress_codec 1 57.7812 57.7812 1.443 0.4420
Error 1 40.0513 40.0513
Total 7 359.7587 51.3941
=== Interaction Effects: job_time_min ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
shuffle_buffer_kb compress_codec 5.3750 72.9%
shuffle_partitions compress_codec 1.4250 19.3%
shuffle_partitions shuffle_buffer_kb 0.5750 7.8%
=== Summary Statistics: job_time_min ===
shuffle_partitions:
Level N Mean Std Min Max
------------------------------------------------------------
50 4 31.3500 5.6789 25.2000 38.4000
500 4 30.7750 9.3514 22.9000 44.3000
shuffle_buffer_kb:
Level N Mean Std Min Max
------------------------------------------------------------
256 4 28.5250 3.3260 25.2000 33.0000
32 4 33.6000 9.5753 22.9000 44.3000
compress_codec:
Level N Mean Std Min Max
------------------------------------------------------------
lz4 4 26.0000 2.5364 22.9000 28.8000
zstd 4 36.1250 6.7188 28.8000 44.3000
=== Main Effects: shuffle_spill_gb ===
Factor Effect Std Error % Contribution
--------------------------------------------------------------
shuffle_partitions -1.3250 3.1239 51.5%
compress_codec 1.0250 3.1239 39.8%
shuffle_buffer_kb 0.2250 3.1239 8.7%
=== ANOVA Table: shuffle_spill_gb ===
Source DF SS MS F p-value
-----------------------------------------------------------------------------
shuffle_partitions 1 3.5113 3.5113 0.009 0.9394
shuffle_buffer_kb 1 0.1013 0.1013 0.000 0.9897
compress_codec 1 2.1012 2.1012 0.005 0.9531
shuffle_partitions*shuffle_buffer_kb 1 0.6612 0.6612 0.002 0.9736
shuffle_partitions*compress_codec 1 63.2813 63.2813 0.164 0.7548
shuffle_buffer_kb*compress_codec 1 91.8012 91.8012 0.238 0.7108
Error 1 385.0312 385.0312
Total 7 546.4887 78.0698
=== Interaction Effects: shuffle_spill_gb ===
Factor A Factor B Interaction % Contribution
------------------------------------------------------------------------
shuffle_buffer_kb compress_codec -6.7750 52.2%
shuffle_partitions compress_codec -5.6250 43.4%
shuffle_partitions shuffle_buffer_kb 0.5750 4.4%
=== Summary Statistics: shuffle_spill_gb ===
shuffle_partitions:
Level N Mean Std Min Max
------------------------------------------------------------
50 4 20.7750 12.5269 7.3000 34.6000
500 4 19.4500 4.9061 13.2000 24.9000
shuffle_buffer_kb:
Level N Mean Std Min Max
------------------------------------------------------------
256 4 20.0000 12.1751 7.3000 34.6000
32 4 20.2250 5.8220 13.6000 27.6000
compress_codec:
Level N Mean Std Min Max
------------------------------------------------------------
lz4 4 19.6000 9.0255 7.3000 27.6000
zstd 4 20.6250 10.0001 13.2000 34.6000
Optimization Recommendations
=== Optimization: job_time_min ===
Direction: minimize
Best observed run: #6
shuffle_partitions = 500
shuffle_buffer_kb = 32
compress_codec = lz4
Value: 22.9
RSM Model (linear, R² = 0.5692, Adj R² = 0.2461):
Coefficients:
intercept +31.0625
shuffle_partitions +1.6375
shuffle_buffer_kb +4.6375
compress_codec -1.1875
Predicted optimum (from linear model, at observed points):
shuffle_partitions = 500
shuffle_buffer_kb = 256
compress_codec = lz4
Predicted value: 38.5250
Surface optimum (via L-BFGS-B, linear model):
shuffle_partitions = 50
shuffle_buffer_kb = 32
compress_codec = zstd
Predicted value: 23.6000
Model quality: Moderate fit — use predictions directionally, not precisely.
Factor importance:
1. shuffle_buffer_kb (effect: -9.3, contribution: 62.1%)
2. shuffle_partitions (effect: 3.3, contribution: 21.9%)
3. compress_codec (effect: -2.4, contribution: 15.9%)
=== Optimization: shuffle_spill_gb ===
Direction: minimize
Best observed run: #4
shuffle_partitions = 500
shuffle_buffer_kb = 32
compress_codec = zstd
Value: 7.3
RSM Model (linear, R² = 0.5790, Adj R² = 0.2632):
Coefficients:
intercept +20.1125
shuffle_partitions -4.9625
shuffle_buffer_kb +3.4375
compress_codec -1.7625
Predicted optimum (from linear model, at observed points):
shuffle_partitions = 50
shuffle_buffer_kb = 256
compress_codec = lz4
Predicted value: 30.2750
Surface optimum (via L-BFGS-B, linear model):
shuffle_partitions = 500
shuffle_buffer_kb = 32
compress_codec = zstd
Predicted value: 9.9500
Model quality: Moderate fit — use predictions directionally, not precisely.
Factor importance:
1. shuffle_partitions (effect: -9.9, contribution: 48.8%)
2. shuffle_buffer_kb (effect: -6.9, contribution: 33.8%)
3. compress_codec (effect: -3.5, contribution: 17.3%)