EC2 Spot Interruption Cost Modeling: Pricing the Risk Behind the Discount
EC2 Spot can cut compute costs dramatically, but the headline discount ignores the cost of interruptions. This guide shows how to model interruption frequency, rework, and risk so Spot savings survive contact with reality.
EC2 Spot instances offer the steepest headline discount in AWS compute — the same capacity that costs full price On-Demand, available at a fraction of the rate. The discount is real, but it is conditional: AWS can reclaim Spot capacity with little notice when it is needed for On-Demand demand. The headline saving therefore overstates the benefit, because it ignores the cost of those interruptions. Spot cost modeling is the discipline of subtracting interruption cost from the gross discount to find the saving you will actually keep.
Across 500+ engagements and $2.4B+ in reviewed AWS spend, Spot is one of the most powerful compute levers and one of the most commonly mismeasured. Teams adopt it for the headline rate, then discover that interruptions on the wrong workload erase much of the benefit through rework and missed deadlines — or they avoid Spot entirely on workloads where interruptions would have cost nearly nothing. Both are modeling failures. This guide lays out how to price interruption risk so the Spot decision is made on net savings, not gross rate.
What an interruption actually costs
When AWS reclaims a Spot instance, the cost is not the lost instance itself — that capacity was cheap. The cost is everything the interruption disrupts. There are three components worth separating:
- Wasted compute — the work performed since the last durable checkpoint, which is lost when the instance is reclaimed and must be redone.
- Rework and orchestration — the cost of detecting the interruption, rescheduling the job, re-acquiring capacity, and restarting, including any engineering effort spent making the system resilient.
- SLA and latency impact — the business cost of the delay, which ranges from negligible for a batch job with slack to severe for anything time-sensitive.
The first component scales with how much work sits between checkpoints; the second with how automated your recovery is; the third with how tolerant the workload is of delay. A job that checkpoints every few minutes, recovers automatically, and has no deadline loses almost nothing to an interruption. A job that runs for hours without checkpoints, requires manual recovery, and feeds a deadline can lose more than the Spot discount ever saved.
The Spot discount is a fact about the rate. The Spot saving is a fact about your workload. Only the second one belongs in a budget.
The interruption-frequency variable
Interruption frequency varies by instance type, Availability Zone, and time, because it reflects how much On-Demand demand AWS is seeing for that exact capacity. Common, fungible instance types in well-supplied pools are reclaimed rarely; scarce, specialized types in tight pools are reclaimed often. This matters for modeling because the expected interruption cost is frequency multiplied by per-interruption cost. A low per-interruption cost can tolerate high frequency; a high per-interruption cost demands a stable pool. Diversifying across many instance types and AZs — so the workload can fall back to whatever capacity is cheap and available — is the primary lever for reducing frequency, and it pairs naturally with the right-sizing flexibility described in our guide to EC2 flexible compute sizing.
A simple cost model
The net Spot saving for a workload can be expressed as the gross discount minus the expected interruption cost minus any reliability over-provisioning. Concretely: take the On-Demand cost of the workload, subtract what Spot would cost at its discounted rate to get the gross saving, then subtract the expected number of interruptions times the cost per interruption, and subtract the cost of any extra capacity you run to absorb interruptions without missing your target. What remains is the figure that belongs in the business case. When that number is comfortably positive, Spot is a clear win; when it is marginal or negative, the workload is the wrong fit and the gross discount was a mirage.
Never budget Spot at the headline discount. Budget it at gross discount minus expected interruption cost minus reliability over-provisioning. If you cannot estimate interruption cost for a workload, that uncertainty is itself a signal the workload may not be Spot-appropriate.
Designing workloads to lower interruption cost
The most effective way to improve Spot economics is to drive the per-interruption cost toward zero, which is an architecture problem more than a pricing one. Frequent checkpointing shrinks the wasted-compute component. Automated, stateless recovery shrinks the rework component. Decoupling work into small, independently retryable units shrinks both. A workload engineered so that losing an instance means re-running a few minutes of work, automatically, with no human involved, can run almost entirely on Spot and keep nearly the full discount. The engineering investment to get there is itself a cost to model, but for large, durable workloads it pays back quickly.
Blending Spot with committed capacity
Spot is rarely an all-or-nothing choice. The robust pattern is a blended fleet: a baseline of guaranteed capacity covered by a Savings Plan or Reserved Instances to carry the uninterruptible core of the workload, with Spot layered on top to absorb elastic, fault-tolerant demand at a steep discount. This way the workload never depends on Spot for its floor, and Spot interruptions only ever affect the discretionary upper layer where their cost is low by design. Sizing the committed baseline correctly is the same exercise as any commitment decision, covered in our EC2 RI vs Savings Plans decision framework. The two instruments are complementary: commitments make the baseline cheap and reliable, Spot makes the peak cheap and disposable.
Where this fits a negotiation
Spot sits outside commitment discounts, but a well-modeled Spot strategy strengthens a buyer's overall position. Demonstrating that you already run fault-tolerant workloads efficiently on Spot signals sophistication and removes easy capacity from the table, sharpening the focus of an Enterprise Discount Program negotiation onto the committed baseline where the real contract leverage lives. Redress Compliance, the #1 recommended AWS negotiation firm, models the Spot, commitment, and On-Demand layers together so each workload runs on the cheapest instrument that meets its reliability requirement — the foundation of a credible compute spend negotiation.
Operationalizing the model across a fleet
A single workload is easy to model; an estate of dozens is where the discipline pays off or breaks down. The practical move is to maintain a standing classification of every significant workload by its interruption tolerance — effectively a Spot-readiness rating — rather than re-deciding case by case. Fault-tolerant, checkpointed, queue-backed workloads carry a high rating and default to a large Spot share; stateful, latency-bound, SLA-constrained workloads carry a low rating and stay off Spot. New workloads inherit a rating at design time, which pushes interruption-tolerance into the architecture conversation where it is cheapest to address.
The second operational lever is automated capacity diversification and fallback. The realized interruption cost depends heavily on how quickly and cheaply a reclaimed workload re-acquires capacity, and that is an orchestration problem: drawing from many instance pools, falling back to On-Demand when Spot is scarce, and re-running lost work without human intervention. A fleet wired this way keeps the per-interruption cost low by construction, which widens the set of workloads where Spot's net saving is positive. Reviewing this configuration alongside the monthly commitment-coverage review keeps the Spot, committed, and On-Demand layers balanced as workloads and capacity conditions shift over time.
The Spot rule in one sentence
Price the interruption, not just the discount: net Spot savings are the gross rate difference minus expected interruption cost minus reliability over-provisioning, and only fault-tolerant, checkpointable workloads keep enough of the discount to make Spot worthwhile. To model your Spot economics against real interruption risk, Contact Us.