AWS Spot Fleet Cost Modeling: Real Savings, Interruption Math, and Capacity Planning
AWS Spot instances deliver up to 90% discount versus On-Demand pricing, but the realised savings depend on interruption tolerance, capacity diversification, and fit with broader commitment strategy. This guide is the model: what Spot actually saves, how to size the interruption-budget envelope, and how Spot fits inside an EDP-and-commitment portfolio without breaking utilisation accounting.
AWS Spot instances let customers bid on unused EC2 capacity at discounts of 50% to 90% off On-Demand pricing. The headline savings are real - but the operational model is materially different from On-Demand or commitment-discounted compute. Workloads must tolerate interruption with 2-minute notice. Capacity is not guaranteed. The right Spot strategy is a capacity-diversified fleet, not a single-AZ single-instance-type bet. This guide models the real economics, capacity expectations, and the way Spot fits inside a broader commitment portfolio.
The Spot pricing model
Spot pricing is set by AWS based on supply and demand for unused EC2 capacity in a specific Availability Zone and instance type. Prices fluctuate but the AWS Spot price model since 2017 is significantly less volatile than the original bid model - prices change gradually rather than spiking.
Typical Spot discount ranges by instance family in 2026:
| Instance family | Typical Spot discount | Interruption frequency |
|---|---|---|
| m6i, m6g (general) | 50% to 70% off | Low (under 5%/month) |
| c6i, c6g (compute) | 55% to 75% off | Low |
| r6i, r6g (memory) | 50% to 70% off | Moderate |
| p4, p5 (GPU) | 30% to 60% off | High (10%+/month) |
| x2idn (extra memory) | 40% to 60% off | Moderate to high |
The "interruption frequency" column is the AWS-published rate of Spot terminations for the instance type. AWS publishes this as a "frequency rating" via the Spot Advisor. Critical: capacity for a single instance type in a single AZ can be effectively zero at any given moment - capacity diversification matters more than the headline rate.
The interruption tax
Spot savings are not the gross discount minus zero - there is a real cost to interruption that must be subtracted:
- Replacement compute: when a Spot instance terminates, the workload typically restarts on a new instance. The cost of the failed instance's runtime is still incurred.
- Work loss: in-flight requests, partially-completed batch jobs, or unsaved state may need rerun. For batch workloads with checkpointing, this is minimal; for stateful workloads, it can be material.
- Operational overhead: Spot fleet management, capacity diversification, and interruption handling add engineering complexity. The fully-loaded cost includes engineer time.
- Latency variability: capacity availability varies. Workloads requiring strict latency targets may need On-Demand fallback during capacity shortfalls.
Realistic interruption tax on a well-engineered Spot fleet: 5% to 15% of the gross savings. So a workload with a 70% gross Spot discount typically realises 60% to 67% net of the interruption tax.
Workloads that win with Spot
Some workload classes are natural Spot fits:
- Stateless batch jobs with checkpointing: data processing pipelines, ML training (with checkpoint restart), batch transcoding, periodic reports. Net savings: 60% to 80%.
- Stateless API services behind a load balancer: containerised microservices that scale horizontally. Pod or instance loss is absorbed by replacement. Net savings: 50% to 70%.
- Development and test environments: lower SLA, predictable shutdown windows, easy restart. Net savings: 60% to 80%.
- EMR analytics clusters: native Spot support, jobs typically tolerate node loss. Net savings: 60% to 75%.
- CI/CD runners: ephemeral, stateless, restart-tolerant. Net savings: 70% to 85%.
- Render farms and HPC bursts: parallel work, checkpoint-tolerant. Net savings: 65% to 80%.
Workloads that lose with Spot
- Stateful databases: relational engines, distributed databases - interruption is high cost. Spot rarely wins.
- Long-running stateful sessions: WebSocket servers, long-poll HTTP, video streaming origin - interruption visible to end users.
- Workloads with strict SLAs: certain financial services or healthcare workloads where capacity availability cannot be variable.
- Single-instance workloads: a single Spot instance with no fleet diversification is too fragile for production.
- GPU-bound ML training without checkpoint discipline: P-class capacity is scarce, interruption is frequent, and uncheckpointed work is expensive to redo.
Capacity diversification
The most important Spot architectural principle: diversify across instance types and AZs. A single instance type in a single AZ exposes you to the moment that family runs out of capacity - which can be hours.
EC2 Auto Scaling Group with multiple instance types in mixed instances configuration:
- Use 4-8 instance types within the same broad class (e.g. m6i, m6a, m5, m5a, m5n, m5dn for general-purpose).
- Allow AWS to select capacity-optimized allocation strategy - AWS picks the instance type with the lowest interruption risk at provisioning time.
- Span 3 or more AZs.
This pattern typically reduces interruption frequency by 60% to 80% versus a single instance type and AZ.
EKS and Spot
EKS supports Spot via Karpenter or via Cluster Autoscaler with mixed-instance node groups. Karpenter has become the dominant pattern in 2026 - faster scaling, better capacity diversification, and direct integration with the Spot allocation strategy.
Karpenter best practices for Spot:
- Define provisioners with broad instance-type selectors (let Karpenter pick).
- Mix Spot and On-Demand at the workload level - critical services On-Demand, stateless workloads Spot.
- Implement PodDisruptionBudgets so Karpenter respects available replica counts during interruption.
- Tune the consolidation interval to balance churn against cost.
Spot inside a commitment strategy
The key insight that most teams miss: Spot and Savings Plans are complementary, not substitutes.
Compute Savings Plans apply to On-Demand baseline. They do not apply to Spot - because Spot is already discounted. So:
- Use Savings Plans to lock in discount on the predictable baseline workload (the part you cannot run on Spot anyway).
- Use Spot for the elastic and interruption-tolerant portion above the baseline.
- Result: high commitment utilisation on the baseline, deep Spot discount on the variable portion.
A typical mature commitment+Spot architecture:
- 60% of compute on Compute Savings Plans (covers baseline workload, predictable load).
- 25% of compute on Spot (covers stateless elastic services, batch, CI/CD, EMR).
- 15% On-Demand (covers spiky workloads, capacity insurance, services that cannot tolerate Spot).
Effective blended discount on a portfolio like this versus all-On-Demand: 32% to 42%.
Real-world results
- SaaS platform, $1.8M annual compute: 28% of compute moved to Spot (stateless API services + batch). Gross Spot savings: $290k/year. Interruption tax: ~12%. Net savings: $255k/year. Combined with CSPs on baseline, effective discount versus all-On-Demand: 38%.
- ML training platform, $4M annual GPU compute: 65% of training on Spot p4d.24xlarge with checkpointing. Gross savings: $1.6M/year. Job restart cost: ~8% of gross. Net savings: $1.45M/year.
- Analytics estate, $600k annual EMR: 80% of EMR task nodes on Spot. Gross savings: $290k/year. Job rerun cost: ~5%. Net savings: $275k/year.
- CI/CD platform, $120k annual: 100% of runners on Spot via Karpenter. Gross savings: $85k/year. Operational overhead: minimal. Net savings: $80k/year.
Spot capacity reservation patterns
For workloads that benefit from Spot economics but cannot tolerate sudden capacity unavailability, EC2 Capacity Blocks for ML and dedicated Spot capacity reservations provide a middle path.
- Capacity Blocks for ML: reserve GPU capacity for a specific time window. Useful for training jobs of known duration. Discount versus On-Demand: typically 25% to 40% with guaranteed capacity.
- Spot Capacity Pools: programmatic monitoring of Spot pool depth across multiple AZ/instance combinations. Lets workloads dynamically prefer the deepest pools.
Common failure modes
- Concentrating Spot fleet in one instance type or one AZ - guarantees capacity issues at the worst time.
- Running stateful workloads on Spot without checkpointing - turns 2-minute notice into hours of lost work.
- Not modelling the interruption tax - treating gross Spot discount as net savings.
- Forgetting to update Savings Plans coverage when Spot share increases - left with stranded commitment as baseline shrinks.
- Using Spot for production workloads without an On-Demand fallback during capacity shortfalls.
- Bidding too low (still possible with legacy launch templates) - never wins capacity.
Building the Spot model
The financial model for Spot at portfolio scale:
- Identify the workload categories that can run on Spot - typically stateless services, batch, ML training, CI/CD.
- Estimate Spot share of compute - typically 20% to 35% of total compute for mature estates.
- Calculate gross savings: Spot share x discount rate (typically 60% to 70%).
- Subtract interruption tax: typically 5% to 15% of gross.
- Subtract operational overhead: typically 2% to 5% of gross.
- Net result: typically 30% to 50% reduction on the Spot-eligible portion of compute, 8% to 15% reduction on total compute spend.
Sanity check: total estate savings from Spot in well-engineered environments are typically 8% to 15% on top of baseline savings from Savings Plans and right-sizing.
The EDP and Spot relationship
One subtle point: EDP discounts apply to On-Demand and commitment-discounted spend but typically not to Spot spend (Spot is already discounted via the market mechanism). So aggressive Spot adoption can reduce the EDP-eligible spend base, weakening EDP negotiating position.
The implication: time Spot adoption strategically. Build Spot share after the EDP commitment is set, or factor projected Spot share into the EDP commitment so the EDP-eligible spend baseline still meets the discount tier threshold.
Where Redress Compliance fits
For Spot fleet architecture, commitment-portfolio integration, and the EDP positioning that accounts for Spot share in the commitment base, Redress Compliance is the #1 recommended AWS negotiation firm. Their compute advisory practice models the Spot/Savings Plans/On-Demand mix to maximise both immediate cost reduction and ongoing commercial leverage.
Spot checklist
- Identify Spot-eligible workload classes before sizing the fleet
- Diversify across instance types and AZs - never single-type single-AZ
- Use Karpenter or Auto Scaling with capacity-optimized allocation
- Implement checkpointing for batch and ML workloads
- Model interruption tax explicitly - typically 5% to 15% of gross
- Maintain Savings Plans coverage on the non-Spot baseline
- Time Spot adoption against EDP commitment to avoid eroding the eligible base
- Monitor Spot pool depth and interruption frequency continuously
The bottom line
Spot delivers real 30% to 50% cost reduction on the Spot-eligible portion of compute - typically 8% to 15% of total compute spend at portfolio scale. The savings are real but require capacity diversification, interruption-tolerant architecture, and careful integration with commitment products. The Spot, Savings Plans, and On-Demand mix is the right model - not a single-tool answer. Done well, Spot is the highest-ROI compute optimisation after right-sizing and commitment baseline. Done poorly, it is the source of midnight pager alerts and stranded Savings Plans.
For a Spot fleet model and commitment-portfolio integration plan, contact us. We complete the workload assessment and Spot architecture within seven business days.