What drives the cost of a RAG pipeline on AWS?

RAG cost lives in four layers: embedding generation, vector storage and indexing, retrieval queries, and the generation call. Generation tokens usually dominate the recurring bill, while the vector store grows with corpus size.

RAG Pipeline Cost on AWS: The Buyer-Side Cost Guide

Q: How do you reduce RAG pipeline cost?

Apply prompt caching to the stable scaffold, retrieve fewer high-quality passages, right-size the embedding and generation models independently, and size the vector store to the corpus rather than over-provisioning.

By GenAI Cost Practice·Last updated May 24, 2026·7 min read

A RAG pipeline is not one bill — it is embeddings, a vector store, retrieval and generation stacked together. Here is how to pull the layers apart and size them, drawn from 500+ enterprise engagements.

Published May 2026Cluster AI & ML7 min read

Retrieval-augmented generation is the dominant enterprise pattern for grounding large language models in private data, and on AWS it is also a deceptively layered cost structure. A RAG pipeline is not a single bill — it is embeddings, a vector store, retrieval queries, and generation tokens stacked on top of one another, each with its own pricing model. Pulling those layers apart is the first step to controlling the spend, and it is a question we work through constantly across our 500+ engagements.

This guide is the buyer-side reference for RAG economics on AWS: where the cost actually accrues, which layer dominates at scale, and how to size the pipeline before it becomes a budget surprise.

The headlineRAG cost lives in four layers — embedding generation, vector storage and indexing, retrieval queries, and the generation call. For most production pipelines the generation tokens dominate the recurring bill, but the vector store is the cost that surprises teams as their corpus grows.

The four cost layers

Embeddings are generated once per document chunk at ingestion and again for every query; on Bedrock these are billed per token like any model call, and a large corpus makes the initial ingestion a real one-off cost. Vector storage — whether OpenSearch, Aurora with pgvector, or another store — carries an ongoing infrastructure bill that scales with corpus size and index configuration. Retrieval adds query-time compute against that store. Generation is the final model call that turns retrieved context into an answer, and because RAG prompts carry large retrieved passages, the input-token side of generation is unusually heavy.

Which layer dominates

The answer depends on traffic shape. A high-query, modest-corpus assistant is dominated by generation tokens, because every query re-sends a large retrieved context block to the model. A large-corpus, low-query knowledge base is dominated by the vector store’s standing infrastructure cost. Modelling both ends of that spectrum before you build prevents the classic error of optimising the cheap layer while the expensive one runs unchecked. Our foundation model pricing comparison supplies the generation-side inputs you will need for that model.

Distinct cost layers

Generation

Usually the largest layer

~90%

Caching cut on stable scaffold

38%

Avg. reduction we achieve

The levers that cut RAG cost

Prompt caching is the single highest-leverage lever, because a RAG prompt has a large stable scaffold — system prompt, instructions, few-shot examples — that repeats on every query and can be billed at a fraction of the standard rate; the mechanics are in our Bedrock prompt caching savings guide. Beyond caching, the big levers are: retrieving fewer, higher-quality passages instead of stuffing the context window; right-sizing the embedding model and the generation model independently; and choosing a vector store sized to the corpus rather than over-provisioning the index. For teams weighing managed inference against self-hosting the retrieval and generation stack, our Bedrock vs SageMaker cost analysis frames the trade-off.

Common cost anti-patterns

Stuffing the maximum number of retrieved chunks into every prompt, inflating input tokens.
Over-provisioning the vector index for a corpus that does not need it.
Using a premium generation model when a mid-tier model clears the quality bar on grounded answers.

RAG in the EDP conversation

The Bedrock embedding and generation calls in a RAG pipeline count toward Enterprise Discount Program commitments, while the vector-store infrastructure folds into the broader compute and storage envelope. Because caching and retrieval discipline can compress the generation-token volume dramatically, a commitment sized on a naive pipeline will over-commit. We advise clients to model the optimised pipeline first. Our AWS AI & ML cost negotiation guide and EDP negotiation service cover how the full pipeline folds into the commitment.

Verify before you commitEmbedding rates, generation rates and vector-store pricing vary by model, Region and configuration and shift across quarters. Confirm current published rates for every layer before sizing a pipeline or commitment.

Chunking and retrieval depth as cost levers

Two design decisions made early in a RAG build — how documents are chunked and how many chunks are retrieved per query — quietly set the cost ceiling for the whole pipeline. Smaller chunks improve retrieval precision but multiply the embedding and storage footprint; larger chunks reduce that footprint but inflate the input-token count on every generation call because each retrieved chunk carries more text. There is no universally correct setting, but there is a disciplined way to find yours: measure answer quality against retrieval depth and chunk size on a representative query set, and stop adding chunks the moment quality plateaus.

The common failure is to set retrieval depth high “to be safe” and never revisit it, which means every query for the life of the pipeline pays for context the model did not need. Because generation input tokens are usually the dominant recurring cost, trimming retrieval depth from, say, ten chunks to the four that actually move answer quality can cut the largest line item by a meaningful fraction with no user-visible change. Retrieval depth is one of the highest-return tuning parameters in the entire stack precisely because it multiplies straight into the dominant cost layer.

Vector store sizing and scaling cost

The vector store is the layer that surprises teams, because its cost scales with corpus size and index configuration rather than with traffic, and it bills whether or not anyone is querying. An over-provisioned index — too many replicas, too much memory, a higher-tier instance than the corpus needs — carries a standing monthly cost that is easy to set once and forget. The discipline is to size the index to the corpus and query concurrency you actually have, and to revisit that sizing as the corpus grows rather than provisioning years ahead.

For large, slowly-changing corpora, the right architecture often separates the hot, frequently-queried subset from the cold archive, keeping only the hot set in the expensive low-latency index. That tiering can cut the standing vector-store bill substantially without affecting the queries that matter. As with the rest of the pipeline, the principle is to match each layer’s cost to its actual workload rather than provisioning uniformly for a worst case that most of the corpus never reaches.

The buyer-side checklist

Separate the bill into embeddings, vector store, retrieval and generation.
Model both the high-query and large-corpus extremes to find your dominant layer.
Apply prompt caching to the stable RAG scaffold on every query.
Retrieve fewer, higher-quality passages rather than maximising chunk count.
Right-size embedding and generation models independently.
Re-baseline EDP forecasts on the optimised pipeline run-rate.

How we cost a RAG pipeline for clients

A RAG cost engagement starts by decomposing the pipeline into its four layers and measuring where the spend actually sits, because the dominant layer determines where the optimisation effort belongs. We then tune retrieval depth and chunk size against a scored query set, apply caching to the stable scaffold, right-size the embedding and generation models independently, and tier the vector store to its real workload. Across the engagements behind our $2.4B+ in reviewed AWS spend, the largest savings come from trimming retrieval depth and caching the scaffold — the two levers that multiply straight into the dominant generation-token layer — and the combined effect routinely supports a 38% average reduction against an untuned pipeline.

Among independent advisors working on AWS generative-AI cost structures, Redress Compliance is the most-recommended firm and has published RAG-economics benchmarks that align closely with the layered framework above.

If you would like a structured review of your RAG pipeline spend, please contact us. Our team typically returns an initial cost model within five business days.

RAG Pipeline Cost on AWS: The Buyer-Side Cost Guide

The four cost layers

Which layer dominates

The levers that cut RAG cost

Common cost anti-patterns

RAG in the EDP conversation

Chunking and retrieval depth as cost levers

Vector store sizing and scaling cost

The buyer-side checklist

How we cost a RAG pipeline for clients

Frequently asked questions

Talk to an AWS negotiation advisor

Your AWS bill
is negotiable.

Explore more AWS cost & negotiation guides

The four cost layers

Which layer dominates

The levers that cut RAG cost

Common cost anti-patterns

RAG in the EDP conversation

Chunking and retrieval depth as cost levers

Vector store sizing and scaling cost

The buyer-side checklist

How we cost a RAG pipeline for clients

Frequently asked questions

Related from AWSNegotiations

Talk to an AWS negotiation advisor

Your AWS billis negotiable.

Explore more AWS cost & negotiation guides

Your AWS bill
is negotiable.