← All Use Cases
📊
Full Factorial Design

Spark Shuffle Optimization

Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time

Summary

This experiment investigates spark shuffle optimization. Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time.

The design varies 3 factors: shuffle partitions (count), ranging from 50 to 500, shuffle buffer kb (KB), ranging from 32 to 256, and compress codec, ranging from lz4 to zstd. The goal is to optimize 2 responses: job time min (min) (minimize) and shuffle spill gb (GB) (minimize). Fixed conditions held constant across all runs include executor memory = 8g, executor cores = 4.

A full factorial design was used to explore all 8 possible combinations of the 3 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (8 runs).

Key Findings

For job time min, the most influential factors were compress codec (64.2%), shuffle buffer kb (32.2%), shuffle partitions (3.6%). The best observed value was 22.9 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = lz4).

For shuffle spill gb, the most influential factors were shuffle partitions (51.5%), compress codec (39.8%), shuffle buffer kb (8.7%). The best observed value was 7.3 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = zstd).

Recommended Next Steps

Experimental Setup

Factors

FactorLowHighUnit
shuffle_partitions50500count
shuffle_buffer_kb32256KB
compress_codeclz4zstd

Fixed: executor_memory = 8g, executor_cores = 4

Responses

ResponseDirectionUnit
job_time_min↓ minimizemin
shuffle_spill_gb↓ minimizeGB

Configuration

use_cases/37_spark_shuffle_optimization/config.json
{ "metadata": { "name": "Spark Shuffle Optimization", "description": "Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time" }, "factors": [ { "name": "shuffle_partitions", "levels": [ "50", "500" ], "type": "continuous", "unit": "count" }, { "name": "shuffle_buffer_kb", "levels": [ "32", "256" ], "type": "continuous", "unit": "KB" }, { "name": "compress_codec", "levels": [ "lz4", "zstd" ], "type": "categorical", "unit": "" } ], "fixed_factors": { "executor_memory": "8g", "executor_cores": "4" }, "responses": [ { "name": "job_time_min", "optimize": "minimize", "unit": "min" }, { "name": "shuffle_spill_gb", "optimize": "minimize", "unit": "GB" } ], "settings": { "operation": "full_factorial", "test_script": "use_cases/37_spark_shuffle_optimization/sim.sh" } }

Experimental Matrix

The Full Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.

Runshuffle_partitionsshuffle_buffer_kbcompress_codec
150256zstd
250032lz4
3500256lz4
4500256zstd
550256lz4
650032zstd
75032lz4
85032zstd

Step-by-Step Workflow

1

Preview the design

Terminal
$ doe info --config use_cases/37_spark_shuffle_optimization/config.json
2

Generate the runner script

Terminal
$ doe generate --config use_cases/37_spark_shuffle_optimization/config.json \ --output use_cases/37_spark_shuffle_optimization/results/run.sh --seed 42
3

Execute the experiments

Terminal
$ bash use_cases/37_spark_shuffle_optimization/results/run.sh
4

Analyze results

Terminal
$ doe analyze --config use_cases/37_spark_shuffle_optimization/config.json
5

Get optimization recommendations

Terminal
$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json
6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal
$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json --multi
7

Generate the HTML report

Terminal
$ doe report --config use_cases/37_spark_shuffle_optimization/config.json \ --output use_cases/37_spark_shuffle_optimization/results/report.html

Features Exercised

FeatureValue
Design typefull_factorial
Factor typescontinuous (2), categorical (1)
Arg styledouble-dash
Responses2 (job_time_min ↓, shuffle_spill_gb ↓)
Total runs8

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: job_time_min

Top factors: compress_codec (64.2%), shuffle_buffer_kb (32.2%), shuffle_partitions (3.6%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
shuffle_partitions10.66130.66130.0170.9186
shuffle_buffer_kb151.511351.51131.2860.4601
compress_codec1205.0312205.03125.1190.2649
shuffle_partitions*shuffle_buffer_kb10.66120.66120.0170.9186
shuffle_partitions*compress_codec14.06134.06130.1010.8037
shuffle_buffer_kb*compress_codec157.781257.78121.4430.4420
Error140.051340.0513
Total7359.758751.3941

Pareto Chart

Pareto chart for job_time_min

Main Effects Plot

Main effects plot for job_time_min

Normal Probability Plot of Effects

Normal probability plot for job_time_min

Half-Normal Plot of Effects

Half-normal plot for job_time_min

Model Diagnostics

Model diagnostics for job_time_min

Response: shuffle_spill_gb

Top factors: shuffle_partitions (51.5%), compress_codec (39.8%), shuffle_buffer_kb (8.7%).

ANOVA

SourceDFSSMSFp-value
SourceDFSSMSFp-value
shuffle_partitions13.51133.51130.0090.9394
shuffle_buffer_kb10.10130.10130.0000.9897
compress_codec12.10122.10120.0050.9531
shuffle_partitions*shuffle_buffer_kb10.66120.66120.0020.9736
shuffle_partitions*compress_codec163.281363.28130.1640.7548
shuffle_buffer_kb*compress_codec191.801291.80120.2380.7108
Error1385.0312385.0312
Total7546.488778.0698

Pareto Chart

Pareto chart for shuffle_spill_gb

Main Effects Plot

Main effects plot for shuffle_spill_gb

Normal Probability Plot of Effects

Normal probability plot for shuffle_spill_gb

Half-Normal Plot of Effects

Half-normal plot for shuffle_spill_gb

Model Diagnostics

Model diagnostics for shuffle_spill_gb

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

job time min shuffle partitions vs shuffle buffer kb

RSM surface: job time min shuffle partitions vs shuffle buffer kb

shuffle spill gb shuffle partitions vs shuffle buffer kb

RSM surface: shuffle spill gb shuffle partitions vs shuffle buffer kb

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability
D = 0.8947

Per-Response Desirability

ResponseWeightDesirabilityPredictedDir
job_time_min 1.5
0.8568
25.20 0.8568 25.20 min
shuffle_spill_gb 1.0
0.9545
7.30 0.9545 7.30 GB

Recommended Settings

FactorValue
shuffle_partitions500 count
shuffle_buffer_kb32 KB
compress_codeclz4

Source: from observed run #4

Trade-off Summary

Sacrifice = how much worse than single-objective best.

ResponsePredictedBest ObservedSacrifice
shuffle_spill_gb7.307.30+0.00

Top 3 Runs by Desirability

RunDFactor Settings
#60.7811shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4
#30.7251shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4

Model Quality

ResponseType
shuffle_spill_gb0.7547linear

Full Multi-Objective Output

doe optimize --multi
============================================================ MULTI-OBJECTIVE OPTIMIZATION Method: Derringer-Suich Desirability Function ============================================================ Overall desirability: D = 0.8947 Response Weight Desirability Predicted Direction --------------------------------------------------------------------- job_time_min 1.5 0.8568 25.20 min ↓ shuffle_spill_gb 1.0 0.9545 7.30 GB ↓ Recommended settings: shuffle_partitions = 500 count shuffle_buffer_kb = 32 KB compress_codec = lz4 (from observed run #4) Trade-off summary: job_time_min: 25.20 (best observed: 22.90, sacrifice: +2.30) shuffle_spill_gb: 7.30 (best observed: 7.30, sacrifice: +0.00) Model quality: job_time_min: R² = 0.2419 (linear) shuffle_spill_gb: R² = 0.7547 (linear) Top 3 observed runs by overall desirability: 1. Run #4 (D=0.8947): shuffle_partitions=500, shuffle_buffer_kb=32, compress_codec=lz4 2. Run #6 (D=0.7811): shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4 3. Run #3 (D=0.7251): shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4

Full Analysis Output

doe analyze
=== Main Effects: job_time_min === Factor Effect Std Error % Contribution -------------------------------------------------------------- compress_codec 10.1250 2.5346 64.2% shuffle_buffer_kb 5.0750 2.5346 32.2% shuffle_partitions -0.5750 2.5346 3.6% === ANOVA Table: job_time_min === Source DF SS MS F p-value ----------------------------------------------------------------------------- shuffle_partitions 1 0.6613 0.6613 0.017 0.9186 shuffle_buffer_kb 1 51.5113 51.5113 1.286 0.4601 compress_codec 1 205.0312 205.0312 5.119 0.2649 shuffle_partitions*shuffle_buffer_kb 1 0.6612 0.6612 0.017 0.9186 shuffle_partitions*compress_codec 1 4.0613 4.0613 0.101 0.8037 shuffle_buffer_kb*compress_codec 1 57.7812 57.7812 1.443 0.4420 Error 1 40.0513 40.0513 Total 7 359.7587 51.3941 === Interaction Effects: job_time_min === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ shuffle_buffer_kb compress_codec 5.3750 72.9% shuffle_partitions compress_codec 1.4250 19.3% shuffle_partitions shuffle_buffer_kb 0.5750 7.8% === Summary Statistics: job_time_min === shuffle_partitions: Level N Mean Std Min Max ------------------------------------------------------------ 50 4 31.3500 5.6789 25.2000 38.4000 500 4 30.7750 9.3514 22.9000 44.3000 shuffle_buffer_kb: Level N Mean Std Min Max ------------------------------------------------------------ 256 4 28.5250 3.3260 25.2000 33.0000 32 4 33.6000 9.5753 22.9000 44.3000 compress_codec: Level N Mean Std Min Max ------------------------------------------------------------ lz4 4 26.0000 2.5364 22.9000 28.8000 zstd 4 36.1250 6.7188 28.8000 44.3000 === Main Effects: shuffle_spill_gb === Factor Effect Std Error % Contribution -------------------------------------------------------------- shuffle_partitions -1.3250 3.1239 51.5% compress_codec 1.0250 3.1239 39.8% shuffle_buffer_kb 0.2250 3.1239 8.7% === ANOVA Table: shuffle_spill_gb === Source DF SS MS F p-value ----------------------------------------------------------------------------- shuffle_partitions 1 3.5113 3.5113 0.009 0.9394 shuffle_buffer_kb 1 0.1013 0.1013 0.000 0.9897 compress_codec 1 2.1012 2.1012 0.005 0.9531 shuffle_partitions*shuffle_buffer_kb 1 0.6612 0.6612 0.002 0.9736 shuffle_partitions*compress_codec 1 63.2813 63.2813 0.164 0.7548 shuffle_buffer_kb*compress_codec 1 91.8012 91.8012 0.238 0.7108 Error 1 385.0312 385.0312 Total 7 546.4887 78.0698 === Interaction Effects: shuffle_spill_gb === Factor A Factor B Interaction % Contribution ------------------------------------------------------------------------ shuffle_buffer_kb compress_codec -6.7750 52.2% shuffle_partitions compress_codec -5.6250 43.4% shuffle_partitions shuffle_buffer_kb 0.5750 4.4% === Summary Statistics: shuffle_spill_gb === shuffle_partitions: Level N Mean Std Min Max ------------------------------------------------------------ 50 4 20.7750 12.5269 7.3000 34.6000 500 4 19.4500 4.9061 13.2000 24.9000 shuffle_buffer_kb: Level N Mean Std Min Max ------------------------------------------------------------ 256 4 20.0000 12.1751 7.3000 34.6000 32 4 20.2250 5.8220 13.6000 27.6000 compress_codec: Level N Mean Std Min Max ------------------------------------------------------------ lz4 4 19.6000 9.0255 7.3000 27.6000 zstd 4 20.6250 10.0001 13.2000 34.6000

Optimization Recommendations

doe optimize
=== Optimization: job_time_min === Direction: minimize Best observed run: #6 shuffle_partitions = 500 shuffle_buffer_kb = 32 compress_codec = lz4 Value: 22.9 RSM Model (linear, R² = 0.5692, Adj R² = 0.2461): Coefficients: intercept +31.0625 shuffle_partitions +1.6375 shuffle_buffer_kb +4.6375 compress_codec -1.1875 Predicted optimum (from linear model, at observed points): shuffle_partitions = 500 shuffle_buffer_kb = 256 compress_codec = lz4 Predicted value: 38.5250 Surface optimum (via L-BFGS-B, linear model): shuffle_partitions = 50 shuffle_buffer_kb = 32 compress_codec = zstd Predicted value: 23.6000 Model quality: Moderate fit — use predictions directionally, not precisely. Factor importance: 1. shuffle_buffer_kb (effect: -9.3, contribution: 62.1%) 2. shuffle_partitions (effect: 3.3, contribution: 21.9%) 3. compress_codec (effect: -2.4, contribution: 15.9%) === Optimization: shuffle_spill_gb === Direction: minimize Best observed run: #4 shuffle_partitions = 500 shuffle_buffer_kb = 32 compress_codec = zstd Value: 7.3 RSM Model (linear, R² = 0.5790, Adj R² = 0.2632): Coefficients: intercept +20.1125 shuffle_partitions -4.9625 shuffle_buffer_kb +3.4375 compress_codec -1.7625 Predicted optimum (from linear model, at observed points): shuffle_partitions = 50 shuffle_buffer_kb = 256 compress_codec = lz4 Predicted value: 30.2750 Surface optimum (via L-BFGS-B, linear model): shuffle_partitions = 500 shuffle_buffer_kb = 32 compress_codec = zstd Predicted value: 9.9500 Model quality: Moderate fit — use predictions directionally, not precisely. Factor importance: 1. shuffle_partitions (effect: -9.9, contribution: 48.8%) 2. shuffle_buffer_kb (effect: -6.9, contribution: 33.8%) 3. compress_codec (effect: -3.5, contribution: 17.3%)
← Previous: Message Queue Consumer Tuning Next: Data Lake Partitioning →