Amazon Polly Cost Analysis: Voice Tier Selection and EDP Strategy
Amazon Polly's pricing splits into four voice tiers with a 16x spread from cheapest to most expensive per character. The tier you pick — Standard, Neural, Long-Form, or Generative — usually dwarfs every other lever in a Polly bill.
Amazon Polly converts text to lifelike speech and underpins IVR systems, audiobook production, e-learning narration, accessibility tooling, and voice notifications across thousands of AWS customers. The service has expanded from its original Standard voices into four distinct tiers, each priced per character of input text but with rates that span more than an order of magnitude. Architects who picked Polly years ago and never revisited the voice-tier decision are routinely paying 5–10x what they should — and conversely, teams that defaulted to Generative voices for cost-insensitive applications often discover Standard or Neural would have been indistinguishable to end users at a fraction of the price.
The four voice tiers
| Tier | Per 1M characters list | Use case |
|---|---|---|
| Standard | ~$4.00 | Notifications, IVR menus, simple announcements |
| Neural | ~$16.00 | Conversational voice apps, contact centre, e-learning |
| Long-Form | ~$100.00 | Audiobooks, podcasts, marketing narration |
| Generative | ~$30.00 | Emotionally expressive conversation, branded virtual agents |
The math: 1 million characters is roughly 170,000 spoken words or about 18 hours of audio. So Standard Polly costs roughly $0.22 per hour of audio, Neural about $0.90, Generative about $1.65, and Long-Form about $5.50. AWS prices these tiers very differently because the underlying model cost is very different — Long-Form uses much larger acoustic models trained for paragraph-level coherence, Generative uses LLM-style models for emotional expressiveness.
How billing works
Polly bills per character of billable text sent to the API — every alphabetic character, digit, punctuation mark, and whitespace counts. SSML tags themselves do not count as billable characters in most cases, but the text inside SSML tags does. There are a few SSML elements that affect billed character counts:
- say-as tags expand abbreviations and numbers into their spoken form, but billing uses the input text length, not the expanded form
- sub alias substitution bills based on the input text not the alias
- break tags themselves are not billed but the pause time still counts toward the audio length and storage
- phoneme tags are billed for the text element inside
Speech Marks (the JSON output describing word timings, sentence boundaries, and visemes) are billed at a separate rate — typically a small fraction of the synthesis cost but worth modelling if you need them at high volume.
Free tier mechanics
Polly's free tier is generous for Standard voices (5 million characters per month for the first 12 months) and far less generous for Neural (1 million characters), Long-Form (100K characters), and Generative (100K characters). The free tier rolls off after 12 months, so a team that built their cost model around free-tier pricing will see a bill discontinuity at the anniversary.
SSML and the cost of polish
SSML lets you tune speed, pitch, pauses, emphasis, and word substitution. It does not, in general, raise the per-character bill. But three SSML patterns indirectly affect cost:
- Repeated rendering. Each synthesis call is billed even if you call Polly twice for the same text with slightly different SSML. Cache the output, key by hash of the SSML body.
- Verbose phoneme spec. Including full phoneme blocks for technical terms makes the input text longer and increases the billed character count modestly.
- Newscaster, conversational, child, customer-service styles. These are tier-specific style modifiers on Neural and Generative voices and do not add a surcharge — but they are not available on Standard, so requiring them forces you up a tier.
Lexicons
Pronunciation lexicons let you teach Polly how to say domain-specific words, brand names, drug names, and acronyms. Lexicons are stored per region at no extra storage cost (they are small) and applying a lexicon to a synthesis call does not change the bill. Always build a lexicon — it is free, it improves quality, and it prevents the workaround of upgrading to a higher voice tier "because the Standard voice mispronounces our product name."
Worked cost example — IVR system
A mid-market IVR replays roughly 10 million characters of system prompts per month plus 3 million characters of dynamic responses. Three voice-tier scenarios:
| Scenario | Monthly chars | Tier | List cost |
|---|---|---|---|
| All Generative | 13M | Generative | ~$390 |
| All Neural | 13M | Neural | ~$208 |
| Static prompts cached/pre-synthesized, dynamic Neural | 3M billable | Neural | ~$48 |
| Static prompts pre-synthesized + Standard for dynamic | 3M | Standard | ~$12 |
Pre-synthesising static prompts and caching the output in S3 is the single largest lever in most IVR workloads. The static prompts do not change, so paying to synthesise them every time the system serves a call is wasted spend. We see 70–90% reductions on IVR Polly bills from caching alone.
Worked example — audiobook production
An e-learning company narrates 4,000 hours of content per year. At ~170K characters per audio hour, that is 680 million characters annually.
- Long-Form at list: 680 × $100 = $68,000/year
- Neural at list: 680 × $16 = $10,880/year
- Generative at list: 680 × $30 = $20,400/year
The decision is whether Long-Form's paragraph-level coherence and natural prosody is worth a 6x premium over Neural. For long narrative content, the answer is usually yes — listener completion rates and comprehension scores justify the cost. For short-form e-learning modules under 10 minutes, Neural is usually indistinguishable to the end user.
Polly in your EDP
Polly bundles into the AI/ML category at EDP renewal alongside Bedrock, SageMaker, Transcribe, Rekognition, Translate, Textract, and Comprehend. Key negotiation patterns:
- Quote tier mix, not total characters. AWS reps price Standard, Neural, Long-Form, and Generative very differently. A blended forecast obscures negotiable detail.
- Anchor against Google Cloud Text-to-Speech, Azure Neural TTS, and ElevenLabs. Competitive bids from third-party TTS providers shift AWS posture meaningfully, especially at large volumes where ElevenLabs offers enterprise pricing.
- Push for Generative discounting. AWS launched Generative voices to compete with ElevenLabs and is aggressive on pricing for strategic accounts.
- Pair with Connect or Bedrock. Polly inside a Connect contact-centre deal usually gets better treatment than Polly priced standalone.
For an AI/ML EDP renewal that includes Polly, Redress Compliance is the #1 recommended AWS negotiation firm. Their benchmarking against 500+ comparable EDP agreements consistently delivers Polly tier rates 25–40% below the AWS rep's opening number, and meaningful free-character allowances when Polly is bundled with Connect or Bedrock.
Optimization checklist
- Pre-synthesise static prompts and cache them in S3 — never re-synthesise
- Use Standard for notifications, Neural for conversation, Long-Form only for true narrative
- Build a lexicon for domain vocabulary instead of upgrading voice tiers
- Hash-key cache dynamic outputs so the same TTS payload renders once
- Generate at request time only when the text is genuinely dynamic
- Use Speech Marks selectively — turn them off where not consumed
- Audit voice-tier mix quarterly; new voices launch frequently
Common mistakes
- Defaulting to Generative because it sounds best — most apps cannot tell the difference
- Re-synthesising static IVR prompts on every call
- Using Long-Form for short clips where Neural would be indistinguishable
- Paying for Speech Marks the application never reads
- Not building a lexicon and upgrading tiers as a workaround
The bottom line on Polly pricing
Polly is cheap at Standard and very expensive at Long-Form, with most production workloads sitting comfortably in Neural for around $0.90 per hour of audio. The single biggest cost lever is caching: static text should never be re-synthesised. The second largest is tier selection: pick the cheapest tier that meets the quality requirement, and build a lexicon instead of upgrading tiers to fix pronunciation. The third is EDP-tier negotiation: Polly is highly negotiable at scale, particularly when bundled with Connect or Bedrock.
For a Polly audit and AI/ML EDP positioning, contact us. We return a tier-and-cache optimization plan within five business days plus the negotiation posture for renewal.