Spark Shuffle Optimization

Summary

This experiment investigates spark shuffle optimization. Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time.

The design varies 3 factors: shuffle partitions (count), ranging from 50 to 500, shuffle buffer kb (KB), ranging from 32 to 256, and compress codec, ranging from lz4 to zstd. The goal is to optimize 2 responses: job time min (min) (minimize) and shuffle spill gb (GB) (minimize). Fixed conditions held constant across all runs include executor memory = 8g, executor cores = 4.

A full factorial design was used to explore all 8 possible combinations of the 3 factors at two levels. This guarantees that every main effect and interaction can be estimated independently, at the cost of a larger experiment (8 runs).

Key Findings

For job time min, the most influential factors were compress codec (64.2%), shuffle buffer kb (32.2%), shuffle partitions (3.6%). The best observed value was 22.9 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = lz4).

For shuffle spill gb, the most influential factors were shuffle partitions (51.5%), compress codec (39.8%), shuffle buffer kb (8.7%). The best observed value was 7.3 (at shuffle partitions = 500, shuffle buffer kb = 32, compress codec = zstd).

Recommended Next Steps

Consider whether any fixed factors should be varied in a future study.

Experimental Setup

Factors

Factor	Low	High	Unit
`shuffle_partitions`	50	500	count
`shuffle_buffer_kb`	32	256	KB
`compress_codec`	lz4	zstd

Fixed: executor_memory = 8g, executor_cores = 4

Responses

Response	Direction	Unit
`job_time_min`	↓ minimize	min
`shuffle_spill_gb`	↓ minimize	GB

Configuration

use_cases/37_spark_shuffle_optimization/config.json

{
  "metadata": {
    "name": "Spark Shuffle Optimization",
    "description": "Central Composite design to optimize Spark shuffle partitions, buffer size, and compression for job time"
  },
  "factors": [
    {
      "name": "shuffle_partitions",
      "levels": [
        "50",
        "500"
      ],
      "type": "continuous",
      "unit": "count"
    },
    {
      "name": "shuffle_buffer_kb",
      "levels": [
        "32",
        "256"
      ],
      "type": "continuous",
      "unit": "KB"
    },
    {
      "name": "compress_codec",
      "levels": [
        "lz4",
        "zstd"
      ],
      "type": "categorical",
      "unit": ""
    }
  ],
  "fixed_factors": {
    "executor_memory": "8g",
    "executor_cores": "4"
  },
  "responses": [
    {
      "name": "job_time_min",
      "optimize": "minimize",
      "unit": "min"
    },
    {
      "name": "shuffle_spill_gb",
      "optimize": "minimize",
      "unit": "GB"
    }
  ],
  "settings": {
    "operation": "full_factorial",
    "test_script": "use_cases/37_spark_shuffle_optimization/sim.sh"
  }
}

Experimental Matrix

The Full Factorial Design produces 8 runs. Each row is one experiment with specific factor settings.

Run	`shuffle_partitions`	`shuffle_buffer_kb`	`compress_codec`
1	50	256	zstd
2	500	32	lz4
3	500	256	lz4
4	500	256	zstd
5	50	256	lz4
6	500	32	zstd
7	50	32	lz4
8	50	32	zstd

Step-by-Step Workflow

1

Preview the design

Terminal

$ doe info --config use_cases/37_spark_shuffle_optimization/config.json

2

Generate the runner script

Terminal

$ doe generate --config use_cases/37_spark_shuffle_optimization/config.json \
    --output use_cases/37_spark_shuffle_optimization/results/run.sh --seed 42

3

Execute the experiments

Terminal

$ bash use_cases/37_spark_shuffle_optimization/results/run.sh

4

Analyze results

Terminal

$ doe analyze --config use_cases/37_spark_shuffle_optimization/config.json

5

Get optimization recommendations

Terminal

$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json

6

Multi-objective optimization

With 2 competing responses, use --multi to find the best compromise via Derringer–Suich desirability.

Terminal

$ doe optimize --config use_cases/37_spark_shuffle_optimization/config.json --multi

7

Generate the HTML report

Terminal

$ doe report --config use_cases/37_spark_shuffle_optimization/config.json \
    --output use_cases/37_spark_shuffle_optimization/results/report.html

Features Exercised

Feature	Value
Design type	`full_factorial`
Factor types	`continuous` (2), `categorical` (1)
Arg style	`double-dash`
Responses	2 (job_time_min ↓, shuffle_spill_gb ↓)
Total runs	8

Analysis Results

Generated from actual experiment runs using the DOE Helper Tool.

Response: job_time_min

Top factors: compress_codec (64.2%), shuffle_buffer_kb (32.2%), shuffle_partitions (3.6%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
shuffle_partitions	1	0.6613	0.6613	0.017	0.9186
shuffle_buffer_kb	1	51.5113	51.5113	1.286	0.4601
compress_codec	1	205.0312	205.0312	5.119	0.2649
shuffle_partitions*shuffle_buffer_kb	1	0.6612	0.6612	0.017	0.9186
shuffle_partitions*compress_codec	1	4.0613	4.0613	0.101	0.8037
shuffle_buffer_kb*compress_codec	1	57.7812	57.7812	1.443	0.4420
Error	1	40.0513	40.0513
Total	7	359.7587	51.3941

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for job_time_min

Half-Normal Plot of Effects

Model Diagnostics

Response: shuffle_spill_gb

Top factors: shuffle_partitions (51.5%), compress_codec (39.8%), shuffle_buffer_kb (8.7%).

ANOVA

Source	DF	SS	MS	F	p-value
Source	DF	SS	MS	F	p-value
shuffle_partitions	1	3.5113	3.5113	0.009	0.9394
shuffle_buffer_kb	1	0.1013	0.1013	0.000	0.9897
compress_codec	1	2.1012	2.1012	0.005	0.9531
shuffle_partitions*shuffle_buffer_kb	1	0.6612	0.6612	0.002	0.9736
shuffle_partitions*compress_codec	1	63.2813	63.2813	0.164	0.7548
shuffle_buffer_kb*compress_codec	1	91.8012	91.8012	0.238	0.7108
Error	1	385.0312	385.0312
Total	7	546.4887	78.0698

Pareto Chart

Main Effects Plot

Normal Probability Plot of Effects

Normal probability plot for shuffle_spill_gb

Half-Normal Plot of Effects

Model Diagnostics

Response Surface Plots

3D surfaces fitted with quadratic RSM. Red dots are observed data points.

job time min shuffle partitions vs shuffle buffer kb

shuffle spill gb shuffle partitions vs shuffle buffer kb

Multi-Objective Optimization

When responses compete, Derringer–Suich desirability finds the best compromise. Each response is scaled to a 0–1 desirability, then combined via a weighted geometric mean.

Overall Desirability

D = 0.8947

Per-Response Desirability

Response	Weight	Desirability	Predicted	Dir
`job_time_min`	1.5	0.8568	25.20 0.8568 25.20 min	↓
`shuffle_spill_gb`	1.0	0.9545	7.30 0.9545 7.30 GB	↓

Recommended Settings

Factor	Value
`shuffle_partitions`	500 count
`shuffle_buffer_kb`	32 KB
`compress_codec`	lz4

Source: from observed run #4

Trade-off Summary

Sacrifice = how much worse than single-objective best.

Response	Predicted	Best Observed	Sacrifice
`shuffle_spill_gb`	7.30	7.30	+0.00

Top 3 Runs by Desirability

Run	D	Factor Settings
#6	0.7811	shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4
#3	0.7251	shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4

Model Quality

Response	R²	Type
`shuffle_spill_gb`	0.7547	linear

Full Multi-Objective Output

doe optimize --multi
============================================================
MULTI-OBJECTIVE OPTIMIZATION
Method: Derringer-Suich Desirability Function
============================================================

Overall desirability: D = 0.8947

Response                  Weight Desirability    Predicted  Direction
---------------------------------------------------------------------
job_time_min                 1.5       0.8568       25.20 min   ↓
shuffle_spill_gb             1.0       0.9545        7.30 GB   ↓

Recommended settings:
  shuffle_partitions = 500 count
  shuffle_buffer_kb = 32 KB
  compress_codec = lz4
  (from observed run #4)

Trade-off summary:
  job_time_min: 25.20 (best observed: 22.90, sacrifice: +2.30)
  shuffle_spill_gb: 7.30 (best observed: 7.30, sacrifice: +0.00)

Model quality:
  job_time_min: R² = 0.2419 (linear)
  shuffle_spill_gb: R² = 0.7547 (linear)

Top 3 observed runs by overall desirability:
  1. Run #4 (D=0.8947): shuffle_partitions=500, shuffle_buffer_kb=32, compress_codec=lz4
  2. Run #6 (D=0.7811): shuffle_partitions=50, shuffle_buffer_kb=256, compress_codec=lz4
  3. Run #3 (D=0.7251): shuffle_partitions=50, shuffle_buffer_kb=32, compress_codec=lz4

Full Analysis Output

doe analyze
=== Main Effects: job_time_min ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
compress_codec          10.1250       2.5346            64.2%
shuffle_buffer_kb        5.0750       2.5346            32.2%
shuffle_partitions      -0.5750       2.5346             3.6%

=== ANOVA Table: job_time_min ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
shuffle_partitions           1       0.6613       0.6613      0.017     0.9186
shuffle_buffer_kb            1      51.5113      51.5113      1.286     0.4601
compress_codec               1     205.0312     205.0312      5.119     0.2649
shuffle_partitions*shuffle_buffer_kb    1       0.6612       0.6612      0.017     0.9186
shuffle_partitions*compress_codec    1       4.0613       4.0613      0.101     0.8037
shuffle_buffer_kb*compress_codec    1      57.7812      57.7812      1.443     0.4420
Error                        1      40.0513      40.0513
Total                        7     359.7587      51.3941

=== Interaction Effects: job_time_min ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
shuffle_buffer_kb    compress_codec             5.3750            72.9%
shuffle_partitions   compress_codec             1.4250            19.3%
shuffle_partitions   shuffle_buffer_kb          0.5750             7.8%

=== Summary Statistics: job_time_min ===

shuffle_partitions:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  50                  4    31.3500     5.6789    25.2000    38.4000
  500                 4    30.7750     9.3514    22.9000    44.3000

shuffle_buffer_kb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 4    28.5250     3.3260    25.2000    33.0000
  32                  4    33.6000     9.5753    22.9000    44.3000

compress_codec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  lz4                 4    26.0000     2.5364    22.9000    28.8000
  zstd                4    36.1250     6.7188    28.8000    44.3000

=== Main Effects: shuffle_spill_gb ===
Factor                   Effect    Std Error   % Contribution
--------------------------------------------------------------
shuffle_partitions      -1.3250       3.1239            51.5%
compress_codec           1.0250       3.1239            39.8%
shuffle_buffer_kb        0.2250       3.1239             8.7%

=== ANOVA Table: shuffle_spill_gb ===
Source                      DF           SS           MS          F    p-value
-----------------------------------------------------------------------------
shuffle_partitions           1       3.5113       3.5113      0.009     0.9394
shuffle_buffer_kb            1       0.1013       0.1013      0.000     0.9897
compress_codec               1       2.1012       2.1012      0.005     0.9531
shuffle_partitions*shuffle_buffer_kb    1       0.6612       0.6612      0.002     0.9736
shuffle_partitions*compress_codec    1      63.2813      63.2813      0.164     0.7548
shuffle_buffer_kb*compress_codec    1      91.8012      91.8012      0.238     0.7108
Error                        1     385.0312     385.0312
Total                        7     546.4887      78.0698

=== Interaction Effects: shuffle_spill_gb ===
Factor A             Factor B              Interaction   % Contribution
------------------------------------------------------------------------
shuffle_buffer_kb    compress_codec            -6.7750            52.2%
shuffle_partitions   compress_codec            -5.6250            43.4%
shuffle_partitions   shuffle_buffer_kb          0.5750             4.4%

=== Summary Statistics: shuffle_spill_gb ===

shuffle_partitions:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  50                  4    20.7750    12.5269     7.3000    34.6000
  500                 4    19.4500     4.9061    13.2000    24.9000

shuffle_buffer_kb:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  256                 4    20.0000    12.1751     7.3000    34.6000
  32                  4    20.2250     5.8220    13.6000    27.6000

compress_codec:
  Level               N       Mean        Std        Min        Max
  ------------------------------------------------------------
  lz4                 4    19.6000     9.0255     7.3000    27.6000
  zstd                4    20.6250    10.0001    13.2000    34.6000

Optimization Recommendations

doe optimize
=== Optimization: job_time_min ===
Direction: minimize

Best observed run: #6
  shuffle_partitions = 500
  shuffle_buffer_kb = 32
  compress_codec = lz4
  Value: 22.9

RSM Model (linear, R² = 0.5692, Adj R² = 0.2461):
  Coefficients:
    intercept                      +31.0625
    shuffle_partitions             +1.6375
    shuffle_buffer_kb              +4.6375
    compress_codec                 -1.1875

  Predicted optimum (from linear model, at observed points):
    shuffle_partitions = 500
    shuffle_buffer_kb = 256
    compress_codec = lz4
    Predicted value: 38.5250

  Surface optimum (via L-BFGS-B, linear model):
    shuffle_partitions = 50
    shuffle_buffer_kb = 32
    compress_codec = zstd
    Predicted value: 23.6000

  Model quality: Moderate fit — use predictions directionally, not precisely.

Factor importance:
  1. shuffle_buffer_kb  (effect: -9.3, contribution: 62.1%)
  2. shuffle_partitions  (effect: 3.3, contribution: 21.9%)
  3. compress_codec  (effect: -2.4, contribution: 15.9%)

=== Optimization: shuffle_spill_gb ===
Direction: minimize

Best observed run: #4
  shuffle_partitions = 500
  shuffle_buffer_kb = 32
  compress_codec = zstd
  Value: 7.3

RSM Model (linear, R² = 0.5790, Adj R² = 0.2632):
  Coefficients:
    intercept                      +20.1125
    shuffle_partitions             -4.9625
    shuffle_buffer_kb              +3.4375
    compress_codec                 -1.7625

  Predicted optimum (from linear model, at observed points):
    shuffle_partitions = 50
    shuffle_buffer_kb = 256
    compress_codec = lz4
    Predicted value: 30.2750

  Surface optimum (via L-BFGS-B, linear model):
    shuffle_partitions = 500
    shuffle_buffer_kb = 32
    compress_codec = zstd
    Predicted value: 9.9500

  Model quality: Moderate fit — use predictions directionally, not precisely.

Factor importance:
  1. shuffle_partitions  (effect: -9.9, contribution: 48.8%)
  2. shuffle_buffer_kb  (effect: -6.9, contribution: 33.8%)
  3. compress_codec  (effect: -3.5, contribution: 17.3%)