AWS AI and ML Cost Negotiation Guide: SageMaker, Bedrock, GPUs, and the EDP
AWS AI and ML workloads are the fastest-growing line on enterprise bills. They are also the most negotiable. SageMaker, Bedrock, GPU instances, and inference endpoints sit inside the EDP commitment with discount levers that compound. This pillar walks the architecture and the contract.
AI and ML spend on AWS has gone from a budget footnote to a budget headline in roughly eighteen months. GPU-instance hours that nobody knew existed in 2022 are now top-three line items on many enterprise bills. Foundation-model inference costs that were rounding errors in 2023 are five and six figures monthly in 2026. SageMaker training jobs, Bedrock token charges, Trainium and Inferentia adoption, and the GPU shortage have combined to make AI cost negotiation the most material conversation many customers have with AWS.
The good news is that AI/ML is also the most negotiable category on AWS. The combination of GPU supply constraints, AWS strategic pressure on Bedrock adoption, and the EDP discount structure creates negotiation leverage that simply does not exist in commodity compute. This pillar walks the cost surface, the discount architecture, and the negotiation moves that consistently take enterprise AI/ML bills down by 30 to 50 percent.
The AI/ML cost surface on AWS
The bill decomposes into roughly six categories, in rough order of typical spend:
- GPU and accelerator compute — p5, p4d, p4de, g6, g5, Inferentia, Trainium instance-hours.
- SageMaker — training, notebook, inference endpoint, model registry, and feature-store charges.
- Bedrock — foundation-model token, image, and customisation pricing.
- Storage — model artefacts, training datasets, vector databases, feature stores.
- Data transfer — pulling training data into training clusters, model deployment across regions.
- Supporting services — Athena, Glue, Redshift, Lake Formation for the data side of ML workflows.
The structural negotiation: EDP
The single most consequential negotiation lever is the Enterprise Discount Programme (EDP). AI/ML spend is bundled inside the EDP commitment alongside compute and storage. The implications:
- AI/ML growth supports a larger EDP commitment, which produces a deeper blended discount on the entire AWS estate.
- Specific AI/ML line items can be discounted aggressively inside the EDP without affecting other line items' visible rates.
- Migration credits and Bedrock adoption credits can be folded into the EDP commitment.
Typical discount range for AI/ML-heavy EDP renewals: 20 to 40 percent off list, depending on commitment size and term. Customers with $5M+ annual AI spend have negotiated 45 percent.
GPU instance economics
GPU instances are the largest single AI/ML cost driver for training-heavy workloads. The hierarchy:
| Instance | GPUs | On-demand | 3-yr RI | Spot |
|---|---|---|---|---|
| p5.48xlarge | 8 x H100 | ~$98/hr | ~$48/hr | varies |
| p4d.24xlarge | 8 x A100 | ~$33/hr | ~$16/hr | varies |
| p4de.24xlarge | 8 x A100 80GB | ~$41/hr | ~$20/hr | varies |
| g6.48xlarge | 8 x L4 | ~$15/hr | ~$8/hr | varies |
| g5.48xlarge | 8 x A10G | ~$16/hr | ~$8/hr | varies |
| trn1.32xlarge | 16 x Trainium | ~$22/hr | ~$11/hr | varies |
| inf2.48xlarge | 12 x Inferentia2 | ~$14/hr | ~$7/hr | varies |
The negotiation moves on GPUs:
- Capacity Block Reservations for p4d and p5 with 1-day to 14-day fixed-term blocks. Useful for training campaigns; better pricing than on-demand for guaranteed access.
- Reserved Instances for sustained inference workloads. 3-year RIs cut H100 cost roughly in half.
- Savings Plans (compute Savings Plans cover GPU instances) for flexibility across instance families.
- Trainium and Inferentia migration credits. AWS funds proof-of-concept work to move workloads off NVIDIA hardware.
- Capacity guarantees embedded in the EDP for strategic customers facing GPU supply constraints.
Trainium and Inferentia: the price/performance lever
Trainium (training) and Inferentia (inference) are AWS-designed accelerators. The price/performance versus NVIDIA hardware is genuinely strong for many workloads:
- Trainium is competitive for transformer training at roughly 40 percent better price/performance than equivalent NVIDIA training.
- Inferentia2 offers significantly better price/performance for inference workloads compatible with the Neuron SDK.
- Migration cost is non-trivial: Neuron SDK compilation, model conversion, and framework adjustments.
The negotiation move: AWS will fund the migration POC. Treat Trainium/Inferentia adoption as an AWS-supported initiative with credits attached, not a cost-reduction exercise the customer pays for.
SageMaker pricing dimensions
SageMaker is not a single price; it is a family of pricing dimensions wrapped around the underlying ML instance types. The lines:
- SageMaker Studio notebooks bill per instance-hour while running.
- Training jobs bill per instance-hour for the duration of the job. Spot training discounts up to 90 percent.
- Inference endpoints bill per instance-hour while deployed. Serverless inference bills per request and idle time.
- Batch transform bills per instance-hour for batch inference.
- Feature store bills storage and online read/write requests.
- Model registry is free; underlying S3 storage bills.
- Studio Lab and Canvas bill separately.
The single largest SageMaker cost reduction lever is shutting down idle notebook instances and right-sizing inference endpoints. See the SageMaker pricing optimization piece for the full breakdown.
Bedrock pricing dimensions
Bedrock is the foundation-model-as-a-service line. The model:
- On-demand token pricing varies by model. Claude, Llama, Titan, and Mistral all have different rates.
- Provisioned Throughput for guaranteed capacity, billed per model-unit-hour. Required for production at scale.
- Customisation (fine-tuning, continued pre-training) bills per training-token plus storage.
- Image generation billed per image with separate rates by model.
The negotiation moves:
- Provisioned Throughput discounts at 1- and 6-month commitment terms. 6-month commits routinely run 30 to 50 percent below on-demand.
- Bundle Bedrock into the EDP analytics/AI commitment for a blended discount.
- Model-specific pricing concessions. AWS will negotiate per-model rates with customers running material volume on Claude or Llama.
- Free tokens for early evaluation; common AWS incentive for adoption.
See the Bedrock pricing strategy piece for the model-by-model breakdown.
The three-layer model
Think of AI/ML cost as three architectural layers, each with its own negotiation lever:
- Infrastructure layer. GPU instances, Trainium/Inferentia, storage. Discounted via Savings Plans, RIs, EDP commit.
- Platform layer. SageMaker, Bedrock managed services. Discounted via EDP, Provisioned Throughput, Inference Components.
- Application layer. Models, prompts, retrieval augmentation. Optimised through prompt engineering, model selection, caching.
The single largest mistake in AI/ML cost programmes is optimising at the wrong layer. A team that aggressively right-sizes inference endpoints (platform layer) but never negotiates the EDP discount (commercial layer) is leaving the larger savings on the table.
Application-layer optimisation
The application-layer levers compound and are often the highest ROI:
- Model selection by task. Use smaller, cheaper models for simple tasks; route to flagship models only when accuracy requires it.
- Prompt caching. Anthropic prompt caching (supported on Bedrock) cuts repeated-context costs by ~90 percent.
- Response caching. Cache identical-prompt responses for FAQ and template-style workloads.
- Retrieval-augmented generation (RAG). Smaller models with good retrieval often outperform larger models without it, at a fraction of the cost.
- Token budget per request. Cap max_tokens aggressively; many prompts produce shorter responses than the cap allows.
- Batch processing. Bedrock batch inference bills 50 percent of on-demand rates for asynchronous workloads.
Worked example: enterprise AI rollout
| Stage | Action | Annual run-rate |
|---|---|---|
| Baseline | On-demand GPU, on-demand Bedrock, no EDP | $8.4M |
| Step 1 | 3-year RIs on inference GPU baseline | $6.2M |
| Step 2 | Trainium adoption for 60% of training | $5.1M |
| Step 3 | Bedrock Provisioned Throughput, 6-month commit | $4.4M |
| Step 4 | EDP renegotiation with AI commitment | $3.2M |
| Step 5 | Application-layer optimisation (caching, model routing) | $2.5M |
A 70 percent reduction is achievable on AI/ML estates where the original bill was naive on-demand. Each step is independently safe and ordered for risk-adjusted ROI.
The negotiation calendar
Three windows compound to produce maximum leverage:
- EDP renewal. Whether at month 24 of a 3-year EDP or month 11 of a 1-year EDP, the renewal is the single largest negotiation moment. AI/ML commitment growth supports a larger discount.
- Provisioned Throughput commitment. 6-month Bedrock PT commitments can be timed to coincide with the EDP cycle.
- Capacity Block / RI purchase. GPU capacity decisions made during the EDP cycle are more negotiable.
The competitive lever
AWS faces real competitive pressure on AI/ML from Azure (OpenAI partnership) and Google Cloud (Gemini, TPU). The negotiation move:
- Run a parallel Azure or GCP RFP for material AI/ML workloads.
- Surface competitive bids to AWS during EDP negotiation.
- Frame Trainium/Inferentia adoption as evidence of AWS strategic commitment.
- Use multi-cloud architectures (training on AWS, inference on GCP, or vice versa) as a negotiation reference even if not the final plan.
See the AWS vs Azure and AWS vs GCP comparisons for the data to support these conversations.
Capacity guarantees
For customers with material AI/ML commitments, AWS will negotiate explicit capacity guarantees:
- Reserved GPU capacity in named availability zones.
- Priority access to new GPU generations at GA.
- Named architects from the AWS Generative AI Innovation Center embedded in the customer's deployment.
- Migration support credits when moving from competitor AI platforms.
These are not on the public price sheet. They appear inside EDP private pricing addenda.
Common failure modes
Over-provisioning inference endpoints
The single most expensive SageMaker anti-pattern. An inference endpoint sized at 8 instances of g5.12xlarge when 2 would suffice burns $50K+ monthly. Use SageMaker Inference Components and serverless inference for low-utilisation endpoints.
Keeping notebook instances running
SageMaker Studio notebooks bill while running. Auto-shutdown lifecycle configurations are essential.
On-demand Bedrock for production
Bedrock on-demand is for development and evaluation. Production-scale workloads on on-demand pay double what Provisioned Throughput would cost.
Single-model architecture
Routing every request to the most powerful model is the most expensive default. Use a model-routing layer that sends simple tasks to small models and reserves flagship models for tasks that require them.
Ignoring the EDP for AI
Treating AI/ML as a separate budget line that the FinOps team manages independently from the EDP. AI/ML spend supports a deeper EDP discount; the FinOps and procurement conversations need to be joined.
Storage and data infrastructure
AI/ML workloads have storage and data-transfer dynamics that the standard FinOps playbook misses:
- Training data lakes often hit S3 lifecycle policies poorly; consider intelligent tiering or Glacier Deep Archive for cold training data.
- Vector databases (OpenSearch, Aurora pgvector, Pinecone) have their own cost surface separate from compute.
- Cross-region data transfer for training can dwarf the compute cost. Co-locate training and data in the same region.
- Model artefacts in S3 grow quickly; implement lifecycle policies on model versions.
Multi-account architecture for AI
Mature AI/ML estates typically run across multiple accounts:
- Dedicated training account with GPU capacity reservations.
- Per-product inference accounts for blast-radius isolation.
- Central data and feature-store account.
- Sandbox account with strict spending limits.
The cost implication: Reserved Instances must be planned across accounts via Consolidated Billing for the discount to flow. See the AWS Organizations billing strategy piece for the mechanics.
Implementation checklist
- Inventory AI/ML spend by service, account, and team over the past 90 days.
- Identify the top three cost lines: typically GPU instances, SageMaker, Bedrock.
- Audit GPU instance utilisation; right-size and consider Capacity Block Reservations.
- Audit SageMaker endpoints and notebooks; shut down idle resources, right-size active ones.
- Evaluate Trainium and Inferentia migration for high-volume workloads.
- Move sustained Bedrock workloads to Provisioned Throughput.
- Implement application-layer optimisations: model routing, prompt caching, response caching.
- Build the EDP renewal case around AI/ML growth.
- Run parallel competitive bids on Azure/GCP for negotiation leverage.
- Contact us for an AI/ML cost review benchmarked against 500+ engagements.
The procurement conversation
The CFO/procurement framing that consistently produces the deepest discounts:
- Present AI/ML as a strategic AWS investment, not just a budget line.
- Quantify the multi-year commitment growth supportable from current AI/ML trajectory.
- Frame Trainium/Inferentia adoption as evidence of strategic alignment.
- Reference competitor pricing without committing to multi-cloud architecture.
- Negotiate the EDP renewal as a partnership uplift, not a cost-cutting exercise.
How AWSNegotiations approaches AI/ML
Across 500+ engagements and $2.4B+ AWS spend reviewed, the consistent pattern in AI/ML negotiation:
- Surface every line: GPU, SageMaker, Bedrock, supporting services, embedded in EDP.
- Build the discount stack: SP/RI + EDP + service-specific concessions + capacity guarantees.
- Time the negotiation to GPU supply pressure and AWS strategic priorities.
- Document the savings: $340M+ delivered to clients, 38 percent average reduction across AI/ML lines.
RAG architecture and vector database cost
Retrieval-augmented generation has become the default architecture for enterprise GenAI applications. The cost surface beyond the foundation-model token charge:
- Embedding generation. Each chunk indexed is an embedding call. For a large corpus this is a one-time cost; for evolving documents it is ongoing.
- Vector database. Options on AWS include OpenSearch with k-NN, Aurora with pgvector, and third-party services via Marketplace. Each has different scaling economics.
- Retrieval calls. Each user query becomes one embedding call plus one vector search plus one foundation-model call.
- Reranking. Cross-encoder rerankers add another model call per query but typically improve accuracy enough to allow using a smaller generation model.
The cost-optimal pattern: cache embeddings aggressively, batch embed when possible, choose the smallest embedding model that meets retrieval quality, and use OpenSearch k-NN at scale rather than pgvector for very large corpora.
Fine-tuning vs prompting vs RAG
The three approaches to customising foundation-model behaviour have very different cost profiles:
| Approach | Upfront cost | Per-query cost | Best for |
|---|---|---|---|
| Prompting | $0 | Token charges only | General-purpose, evolving tasks |
| RAG | Indexing + vector DB | Embedding + retrieval + generation | Knowledge-intensive applications |
| Fine-tuning | Training tokens + storage | Lower per-query (smaller model possible) | Stable, narrow tasks at high volume |
Fine-tuning is rarely the right first step. The default progression: prompt engineering, then RAG, then fine-tuning only if RAG is insufficient and volume justifies the upfront cost.
Model evaluation and observability
Without evaluation infrastructure, optimisation decisions are guesswork. The minimum:
- Track per-request cost across all models in use, attributed to the application and team consuming.
- Measure quality on a held-out evaluation set whenever model selection or prompt changes.
- Alert on cost-per-request regressions, not just absolute spend.
- Use Bedrock model invocation logging to capture prompt and response payloads for analysis.
The pattern that produces the largest optimisation wins: a quarterly review where each AI workload's model selection is revisited based on the latest model launches and price changes. Models that were the cost-optimal choice in 2024 are not the cost-optimal choice in 2026.
Edge inference and on-device tradeoffs
For very high-volume inference workloads (millions of requests per day), edge or on-device inference can be dramatically cheaper than cloud inference. AWS Greengrass and SageMaker Edge Manager support deployment of inference models to edge devices. The cost equation:
- Cloud inference scales linearly with request volume.
- Edge inference is a one-time deployment cost plus device management overhead.
- The crossover depends on per-request cloud cost and device count.
For consumer-facing applications with millions of users, edge inference on a small model often outperforms cloud inference on a larger model both in cost and latency.
Budgets, guardrails, and chargeback
AI/ML spend grows experimentally. Without structural guardrails, a single data-science team can burn six figures in a week experimenting with foundation-model variants. The minimum guardrails:
- Per-team or per-account spend limits with AWS Budgets.
- Per-API-key rate limits on Bedrock invocations.
- SageMaker endpoint quotas at the account level.
- Tag-based chargeback so each business unit sees its actual AI/ML cost.
- A formal "production deployment" gate where new GenAI workloads must demonstrate model selection and prompt optimisation before scaling.
The chargeback step is the highest-leverage governance move. Teams that see their own AI/ML bill optimise their own spend; teams shielded from cost signal do not.
The negotiation timeline
For organisations starting an AI/ML cost programme, the consistent sequence that produces best results:
- Month 1: Inventory and chargeback. Get the data right before negotiating.
- Month 2: Operational right-sizing. Idle endpoints, oversized notebooks, on-demand workloads moved to spot or Savings Plans.
- Month 3: Application-layer optimisation. Model routing, caching, prompt engineering.
- Month 4-5: Trainium/Inferentia evaluation. POCs funded by AWS where available.
- Month 6: EDP negotiation with the cost trajectory and the strategic AI commitment story.
- Ongoing: Quarterly review of model selection, vendor mix, and contract terms.
For more see the SageMaker pricing optimization piece, the Bedrock pricing strategy piece, and the AI training job cost optimization piece for the operational levers. For the contractual side, the EDP negotiation guide is the foundation. For multi-cloud leverage, see the AWS vs Azure and AWS vs GCP comparisons.