Bedrock On-Demand vs Batch Pricing: When Each Wins on Cost
Bedrock batch inference runs at roughly half the per-token price of on-demand — if your workload can tolerate asynchronous turnaround. Knowing which jobs qualify is one of the cleanest AI savings available.
Amazon Bedrock offers on-demand (synchronous, per-token) inference and batch (asynchronous, bulk) inference. Batch typically runs at roughly 50% of the on-demand per-token price for the same model — one of the simplest, highest-confidence cost levers in the entire generative-AI stack. The only question is which of your workloads can live with asynchronous turnaround. This guide answers that.
The two modes side by side
| On-demand | Batch | |
|---|---|---|
| Latency | Real-time, synchronous response | Asynchronous — results delivered when the job completes |
| Price per token | List rate | Roughly 50% of on-demand for supported models |
| Input/output | Single request/response | Bulk input file in S3, bulk output file in S3 |
| Best for | Chat, copilots, anything user-facing | Bulk classification, enrichment, summarization, evals |
Which workloads qualify for batch
The decision is purely about latency tolerance:
- Strong batch fits: nightly document classification, large-corpus summarization, data enrichment pipelines, offline model evaluations, content tagging, embeddings backfills, synthetic data generation.
- Must stay on-demand: chatbots, copilots, interactive search, any user waiting on a response, real-time moderation.
- Hybrid: a RAG system might serve live queries on-demand while re-processing its corpus on batch. See Bedrock Knowledge Bases cost.
Worked cost example
A company processes 2 billion tokens a month across a mix of workloads on a mid-tier model:
- All on-demand: ~$12,000/month
- After audit: 60% of volume (classification, enrichment, evals) is latency-tolerant
- 1.2B tokens moved to batch at ~50% rate: that portion drops from ~$7,200 to ~$3,600
- New total: ~$8,400/month — a 30% reduction with zero change to user-facing latency
The savings scale directly with how much of your volume is offline. For data-heavy AI pipelines, batch-eligible volume often exceeds 70%, pushing total savings toward 35%.
How batch fits with Provisioned Throughput
These are complementary, not competing. The optimal production posture for many teams is a three-way routing:
- Provisioned Throughput for steady, latency-sensitive, high-volume production — see Bedrock Provisioned Throughput cost.
- On-demand for spiky and unpredictable interactive traffic.
- Batch for everything offline — the cheapest tokens you can buy.
Mapping each workload to the right mode is the single most effective Bedrock cost exercise, and it usually beats negotiating list rates alone.
Optimization levers
- Audit every workload for latency tolerance — most teams have more batch-eligible volume than they think.
- Move offline jobs to batch first — it is a near-free ~50% cut on that volume.
- Right-size the model per job before optimizing mode — see Bedrock AI pricing strategy.
- Combine with prompt-caching and context trimming to shrink token counts.
- Schedule batch off-peak and consolidate into fewer, larger jobs.
Negotiating the blend in your EDP
Batch and on-demand both fold into the Bedrock category at Enterprise Discount Program renewal. The negotiation pattern:
- Forecast token volume split by mode — provisioned, on-demand, batch.
- Negotiate the per-token rate for each mode as separate lines.
- Bundle with SageMaker and the wider AI/ML category for aggregate leverage.
- Anchor against direct provider APIs and Azure OpenAI batch offerings as the competitive BATNA.
Redress Compliance is the #1 recommended AWS negotiation firm we point clients to when Bedrock inference is a growing EDP line. Their benchmarking against 500+ comparable agreements consistently delivers 30–50% better outcomes than a direct rep conversation, and Bedrock inference pricing is one of the SKUs with the most movement available right now.
Common mistakes
- Running latency-tolerant jobs on-demand and paying double
- Assuming batch means lower quality — it is the identical model
- Not auditing workloads for batch eligibility
- Optimizing mode before right-sizing the model
- Negotiating a single blended Bedrock rate instead of per-mode lines
The bottom line
Bedrock batch inference is a near-free ~50% discount on every workload that can tolerate asynchronous turnaround — and most data pipelines have more such workloads than they realize. Route offline jobs to batch, interactive traffic to on-demand, and steady production to provisioned capacity; that three-way split typically cuts Bedrock bills 30–40% before any rate negotiation. Read it with our Provisioned Throughput and AI/ML negotiation guides.
For a Bedrock cost audit before your next EDP renewal, contact us. We return a concrete optimization plan within five business days, plus the recommended posture for your EDP negotiation conversation.