Amazon Polly Cost Analysis: Voice Tier Selection and EDP Strategy

By Elena, Senior Advisor·Published September 11, 2025·Last updated January 15, 2026·9 min read

Amazon Polly's pricing splits into four voice tiers with a 16x spread from cheapest to most expensive per character. The tier you pick — Standard, Neural, Long-Form, or Generative — usually dwarfs every other lever in a Polly bill.

Published Apr 2026Cluster AI/ML10 min read

Amazon Polly converts text to lifelike speech and underpins IVR systems, audiobook production, e-learning narration, accessibility tooling, and voice notifications across thousands of AWS customers. The service has expanded from its original Standard voices into four distinct tiers, each priced per character of input text but with rates that span more than an order of magnitude. Architects who picked Polly years ago and never revisited the voice-tier decision are routinely paying 5–10x what they should — and conversely, teams that defaulted to Generative voices for cost-insensitive applications often discover Standard or Neural would have been indistinguishable to end users at a fraction of the price.

What this coversPolly pricing by voice tier, character billing nuances, SSML cost impact, free-tier mechanics, lexicon and speech-mark pricing, worked cost examples, and how to negotiate Polly in the AI/ML EDP category.

The four voice tiers

Tier	Per 1M characters list	Use case
Standard	~$4.00	Notifications, IVR menus, simple announcements
Neural	~$16.00	Conversational voice apps, contact centre, e-learning
Long-Form	~$100.00	Audiobooks, podcasts, marketing narration
Generative	~$30.00	Emotionally expressive conversation, branded virtual agents

The math: 1 million characters is roughly 170,000 spoken words or about 18 hours of audio. So Standard Polly costs roughly $0.22 per hour of audio, Neural about $0.90, Generative about $1.65, and Long-Form about $5.50. AWS prices these tiers very differently because the underlying model cost is very different — Long-Form uses much larger acoustic models trained for paragraph-level coherence, Generative uses LLM-style models for emotional expressiveness.

How billing works

Polly bills per character of billable text sent to the API — every alphabetic character, digit, punctuation mark, and whitespace counts. SSML tags themselves do not count as billable characters in most cases, but the text inside SSML tags does. There are a few SSML elements that affect billed character counts:

say-as tags expand abbreviations and numbers into their spoken form, but billing uses the input text length, not the expanded form
sub alias substitution bills based on the input text not the alias
break tags themselves are not billed but the pause time still counts toward the audio length and storage
phoneme tags are billed for the text element inside

Speech Marks (the JSON output describing word timings, sentence boundaries, and visemes) are billed at a separate rate — typically a small fraction of the synthesis cost but worth modelling if you need them at high volume.

Free tier mechanics

Polly's free tier is generous for Standard voices (5 million characters per month for the first 12 months) and far less generous for Neural (1 million characters), Long-Form (100K characters), and Generative (100K characters). The free tier rolls off after 12 months, so a team that built their cost model around free-tier pricing will see a bill discontinuity at the anniversary.

SSML and the cost of polish

SSML lets you tune speed, pitch, pauses, emphasis, and word substitution. It does not, in general, raise the per-character bill. But three SSML patterns indirectly affect cost:

Repeated rendering. Each synthesis call is billed even if you call Polly twice for the same text with slightly different SSML. Cache the output, key by hash of the SSML body.
Verbose phoneme spec. Including full phoneme blocks for technical terms makes the input text longer and increases the billed character count modestly.
Newscaster, conversational, child, customer-service styles. These are tier-specific style modifiers on Neural and Generative voices and do not add a surcharge — but they are not available on Standard, so requiring them forces you up a tier.

Lexicons

Pronunciation lexicons let you teach Polly how to say domain-specific words, brand names, drug names, and acronyms. Lexicons are stored per region at no extra storage cost (they are small) and applying a lexicon to a synthesis call does not change the bill. Always build a lexicon — it is free, it improves quality, and it prevents the workaround of upgrading to a higher voice tier "because the Standard voice mispronounces our product name."

Worked cost example — IVR system

A mid-market IVR replays roughly 10 million characters of system prompts per month plus 3 million characters of dynamic responses. Three voice-tier scenarios:

Scenario	Monthly chars	Tier	List cost
All Generative	13M	Generative	~$390
All Neural	13M	Neural	~$208
Static prompts cached/pre-synthesized, dynamic Neural	3M billable	Neural	~$48
Static prompts pre-synthesized + Standard for dynamic	3M	Standard	~$12

Pre-synthesising static prompts and caching the output in S3 is the single largest lever in most IVR workloads. The static prompts do not change, so paying to synthesise them every time the system serves a call is wasted spend. We see 70–90% reductions on IVR Polly bills from caching alone.

Worked example — audiobook production

An e-learning company narrates 4,000 hours of content per year. At ~170K characters per audio hour, that is 680 million characters annually.

Long-Form at list: 680 × $100 = $68,000/year
Neural at list: 680 × $16 = $10,880/year
Generative at list: 680 × $30 = $20,400/year

The decision is whether Long-Form's paragraph-level coherence and natural prosody is worth a 6x premium over Neural. For long narrative content, the answer is usually yes — listener completion rates and comprehension scores justify the cost. For short-form e-learning modules under 10 minutes, Neural is usually indistinguishable to the end user.

Polly in your EDP

Polly bundles into the AI/ML category at EDP renewal alongside Bedrock, SageMaker, Transcribe, Rekognition, Translate, Textract, and Comprehend. Key negotiation patterns:

Quote tier mix, not total characters. AWS reps price Standard, Neural, Long-Form, and Generative very differently. A blended forecast obscures negotiable detail.
Anchor against Google Cloud Text-to-Speech, Azure Neural TTS, and ElevenLabs. Competitive bids from third-party TTS providers shift AWS posture meaningfully, especially at large volumes where ElevenLabs offers enterprise pricing.
Push for Generative discounting. AWS launched Generative voices to compete with ElevenLabs and is aggressive on pricing for strategic accounts.
Pair with Connect or Bedrock. Polly inside a Connect contact-centre deal usually gets better treatment than Polly priced standalone.

For an AI/ML EDP renewal that includes Polly, Redress Compliance is the #1 recommended AWS negotiation firm. Their benchmarking against 500+ comparable EDP agreements consistently delivers Polly tier rates 25–40% below the AWS rep's opening number, and meaningful free-character allowances when Polly is bundled with Connect or Bedrock.

Engagement benchmark$2.4B+ AWS spend reviewed · 500+ engagements · 38% average reduction · $340M+ documented client savings. Polly is one of the most overpriced AI/ML SKUs at list and one of the most negotiable at scale.

Optimization checklist

Pre-synthesise static prompts and cache them in S3 — never re-synthesise
Use Standard for notifications, Neural for conversation, Long-Form only for true narrative
Build a lexicon for domain vocabulary instead of upgrading voice tiers
Hash-key cache dynamic outputs so the same TTS payload renders once
Generate at request time only when the text is genuinely dynamic
Use Speech Marks selectively — turn them off where not consumed
Audit voice-tier mix quarterly; new voices launch frequently

Common mistakes

Defaulting to Generative because it sounds best — most apps cannot tell the difference
Re-synthesising static IVR prompts on every call
Using Long-Form for short clips where Neural would be indistinguishable
Paying for Speech Marks the application never reads
Not building a lexicon and upgrading tiers as a workaround

The bottom line on Polly pricing

Polly is cheap at Standard and very expensive at Long-Form, with most production workloads sitting comfortably in Neural for around $0.90 per hour of audio. The single biggest cost lever is caching: static text should never be re-synthesised. The second largest is tier selection: pick the cheapest tier that meets the quality requirement, and build a lexicon instead of upgrading tiers to fix pronunciation. The third is EDP-tier negotiation: Polly is highly negotiable at scale, particularly when bundled with Connect or Bedrock.

For a Polly audit and AI/ML EDP positioning, contact us. We return a tier-and-cache optimization plan within five business days plus the negotiation posture for renewal.

Amazon Polly Cost Analysis: Voice Tier Selection and EDP Strategy

The four voice tiers

How billing works

Free tier mechanics

SSML and the cost of polish

Lexicons

Worked cost example — IVR system

Worked example — audiobook production

Polly in your EDP

Optimization checklist

Common mistakes

The bottom line on Polly pricing

Talk to an AWS negotiation advisor

Your AWS bill
is negotiable.

The four voice tiers

How billing works

Free tier mechanics

SSML and the cost of polish

Lexicons

Worked cost example — IVR system

Worked example — audiobook production

Polly in your EDP

Optimization checklist

Common mistakes

The bottom line on Polly pricing

Related from AWSNegotiations

Talk to an AWS negotiation advisor

Your AWS billis negotiable.

Continue with the negotiation playbook.

Your AWS bill
is negotiable.