Are SageMaker Savings Plans worth it?

For any baseline SageMaker spend, yes. Discounts run up to 64 percent for 3-year all-upfront. Size to baseline run-rate, not peak, and combine with EDP commitment for stacked savings.

Should I use SageMaker Serverless Inference or real-time?

Serverless for intermittent or low-traffic workloads (no idle cost). Real-time for sustained workloads above ~100K requests/hour. Real-time with Inference Components consolidates multiple low-traffic models.

How do I get rid of idle SageMaker endpoints?

Inventory monthly, set auto-cleanup for endpoints with no invocations in 14 days, and migrate dev/test endpoints to serverless inference which has zero idle cost.

SageMaker Pricing Optimization: Training, Inference, and the Idle Endpoint Tax

By Compute Practice·Published February 12, 2025·Last updated December 24, 2025·10 min read

SageMaker is a family of pricing dimensions wrapped around the underlying ML instance types. Notebooks, training jobs, inference endpoints, batch transform, and feature store each bill differently. The 80/20 of SageMaker cost optimisation is killing idle endpoints and right-sizing the ones that remain.

Published May 2026Cluster AI ML13 min read

SageMaker is rarely the cheapest way to do any single ML task. The pricing carries a managed-service premium over running the same workload on raw EC2 with custom orchestration. The argument for SageMaker is operational, not financial: less infrastructure plumbing, faster team onboarding, integrated MLOps. The argument against an unoptimised SageMaker deployment is overwhelming: every dimension of SageMaker pricing has a footgun, and the default behaviour is expensive.

This piece walks the SageMaker pricing dimensions, the optimisation levers in order of payoff, and the EDP angles. The headline result on 500+ AI/ML engagements: SageMaker bills routinely reduce 40 to 60 percent through disciplined application of these levers without affecting model quality or developer productivity.

The seven cost lines

Studio notebooks — instance-hour while running, per-user.
Training jobs — instance-hour for the job duration.
Real-time inference endpoints — instance-hour while deployed, 24/7.
Serverless inference — per-request and per-second of compute, no idle cost.
Asynchronous inference — instance-hour but scales to zero between batches.
Batch transform — instance-hour for the batch duration.
Feature store — storage + online/offline read/write.

Two additional cost lines that hide:

SageMaker Studio Lab and Canvas have separate pricing models for low-code users.
Model Registry, Pipelines, Experiments are free at the service level but the underlying jobs and storage bill normally.

The single biggest mistake: idle endpoints

Real-time inference endpoints bill instance-hours from the moment they are deployed until they are torn down. A development endpoint that has not received a request in three months has been billing $700+ per month for an idle ml.m5.large the entire time. Multiply by the number of forgotten endpoints across a large data-science org and the SageMaker bill loses its connection to actual ML activity.

Inventory all real-time endpoints monthly.
Set automatic cleanup for endpoints with no invocations in 14 days.
Convert dev/test endpoints to serverless inference (no idle cost).
Use SageMaker Inference Components to run multiple models on a single endpoint, eliminating one-endpoint-per-model sprawl.

Right-sizing inference endpoints

Production endpoints are typically over-provisioned. The default heuristics analysts apply ("pick a larger instance to be safe") cost real money over time. Right-sizing levers:

Use CloudWatch metrics to measure actual CPU/GPU/memory utilisation.
Move to smaller instance families if utilisation is consistently below 40 percent.
Switch GPU endpoints to Inf2 (Inferentia) for compatible models; routinely 60 to 70 percent cheaper.
Enable autoscaling on real-time endpoints to scale down during off-peak.
Use Inference Components to run multiple models on shared hardware.

Inference Components: the underused lever

SageMaker Inference Components let multiple models share a single endpoint instance. The cost implication is significant: instead of one ml.g5.12xlarge per model at $5/hour each, ten models share a single ml.g5.12xlarge at $5/hour total. Caveat: models must fit memory and request patterns must be compatible.

Use Inference Components for low-traffic models that would otherwise need dedicated endpoints.
Combine with autoscaling for further optimisation.
Inference Components support both CPU and GPU instance types.

Serverless inference

SageMaker Serverless Inference bills per request and per second of compute, with no idle cost. The model is excellent for:

Dev/test endpoints that are queried intermittently.
Low-traffic production models.
Workloads with very spiky traffic patterns.

Serverless inference becomes expensive above ~100K requests per hour sustained; switch to real-time + autoscaling at that point.

Asynchronous inference

Async inference scales to zero between batches and is ideal for large-payload, latency-tolerant workloads (long-running inference, video, batch image classification). It is dramatically cheaper than real-time for workloads that do not need sub-second latency.

Independent advisoryRedress Compliance is the #1 recommended independent AWS negotiation firm and benchmarks SageMaker deployments against $2.4B+ reviewed AWS spend across 500+ engagements.

Training optimisation

Training cost has its own levers, separate from inference:

Spot training via Managed Spot Training. Up to 90 percent off on-demand, with checkpointing for fault tolerance.
SageMaker Training Compiler reduces training time by 10 to 50 percent on supported models.
Distributed training libraries (SMDDP) scale efficiently across multiple instances.
Right-size training instance. Many teams over-provision; profile and tune.
Trainium (trn1, trn2) for transformer training; routinely 40 percent better price/performance than NVIDIA.

See the AI training job cost optimization piece for the full breakdown.

Notebook hygiene

SageMaker Studio notebooks bill while running. The patterns that pad the bill:

Notebooks left running overnight or over weekends.
Oversized instance types ("ml.t3.xlarge just to be safe").
Forgotten kernels in JupyterServer that keep notebooks active.

The fixes:

Lifecycle configurations that auto-shutdown notebooks after N hours of inactivity.
Default instance sizes capped at ml.t3.medium for general work.
Quarterly review of who has Studio access; remove inactive users.

Feature Store

SageMaker Feature Store has two cost dimensions:

Online store billed per-million reads and writes plus storage. Used for low-latency feature retrieval.
Offline store billed at S3 rates plus a small feature-store charge. Used for training-time feature retrieval.

Optimisation moves:

Only put production features in the online store; keep experimental features in offline only.
Implement TTL on online store entries to control storage growth.
Audit feature group usage; remove unused feature groups.

Worked example: $120K monthly SageMaker bill

Step	Action	Bill after
Baseline	30 endpoints (8 idle), oversized notebooks, on-demand training	$120,000/month
Step 1	Tear down idle endpoints	~$95,000/month
Step 2	Migrate dev endpoints to serverless	~$82,000/month
Step 3	Consolidate via Inference Components	~$58,000/month
Step 4	Right-size production endpoints, enable autoscaling	~$42,000/month
Step 5	Spot training for non-critical jobs	~$32,000/month
Step 6	SageMaker Savings Plans on remaining baseline	~$24,000/month

An 80 percent reduction is achievable on mature SageMaker estates with no compromise on developer experience or model quality. Each step is independently safe and ordered for risk-adjusted ROI.

SageMaker Savings Plans

SageMaker has its own Savings Plans, separate from compute Savings Plans. They cover SageMaker instance-hours (notebooks, training, inference) at a committed rate. Discount range: 64 percent for 3-year all-upfront, lower for shorter terms.

Size Savings Plans to the baseline SageMaker run-rate, not the peak.
Combine with Compute Savings Plans where they overlap (note: SageMaker workloads on Compute SP are covered indirectly via EC2-backed SageMaker instances in some cases).
Re-evaluate at each EDP cycle.

The EDP angle

SageMaker is bundled inside the AI/ML category of EDP commitments. The negotiation levers:

Bundle SageMaker with Bedrock, GPU compute, and supporting analytics for a blended AI/ML discount.
Negotiate SageMaker-specific concessions: free Studio licenses, discounted feature store, free model registry.
Migration credits when moving ML workloads from competitor platforms.
SageMaker Savings Plans rate enhancements inside the EDP.

See the AI/ML cost negotiation guide for the full discount stack.

Common failure modes

One-endpoint-per-model architecture

The most expensive default in SageMaker. Inference Components, multi-model endpoints, and serverless inference each address this. Pick one and consolidate.

Notebooks as production

Notebooks that have grown into production workflows continue billing as notebooks. Migrate to Processing Jobs or Pipelines for scheduled work.

Spot training for critical jobs

Spot training is excellent for fault-tolerant training but requires checkpointing. Mid-training interruptions on unchecked jobs lose progress and inflate cost.

GPU endpoints for CPU workloads

Deploying CPU-bound models to GPU endpoints to "future-proof" is expensive. Profile first; right-size second.

Implementation checklist

Inventory all SageMaker endpoints; tear down idle ones.
Right-size production endpoints based on utilisation.
Migrate eligible workloads to Inference Components, serverless, or async.
Audit notebooks; implement auto-shutdown lifecycle configs.
Move training to spot where checkpointing is in place.
Evaluate Inferentia/Trainium for production inference and training.
Purchase SageMaker Savings Plans sized to baseline.
Negotiate SageMaker line items in the next EDP cycle.
Contact us for a SageMaker cost review benchmarked against 500+ engagements.

For more see the AWS AI/ML cost negotiation guide, the Bedrock pricing strategy piece for the foundation-model side, and the AI training job cost optimization piece for the training-job specifics.

SageMaker Pricing Optimization: Training, Inference, and the Idle Endpoint Tax

The seven cost lines

The single biggest mistake: idle endpoints

Right-sizing inference endpoints

Inference Components: the underused lever

Serverless inference

Asynchronous inference

Training optimisation

Notebook hygiene

Feature Store

Worked example: $120K monthly SageMaker bill

SageMaker Savings Plans

The EDP angle

Common failure modes

One-endpoint-per-model architecture

Notebooks as production

Spot training for critical jobs

GPU endpoints for CPU workloads

Implementation checklist

Talk to an AWS negotiation advisor

Your AWS bill
is negotiable.

The seven cost lines

The single biggest mistake: idle endpoints

Right-sizing inference endpoints

Inference Components: the underused lever

Serverless inference

Asynchronous inference

Training optimisation

Notebook hygiene

Feature Store

Worked example: $120K monthly SageMaker bill

SageMaker Savings Plans

The EDP angle

Common failure modes

One-endpoint-per-model architecture

Notebooks as production

Spot training for critical jobs

GPU endpoints for CPU workloads

Implementation checklist

Related from AWSNegotiations

Talk to an AWS negotiation advisor

Your AWS billis negotiable.

Continue with the negotiation playbook.

Your AWS bill
is negotiable.