AWS Billing Anomaly Detection: where it works, where it fails, and the buyer playbook around it
AWS Cost Anomaly Detection is free, ships in the Billing Console, and is genuinely useful - but it is tuned for AWS's definition of an anomaly, not yours. The buyers who use it well wrap it in a tighter detection layer of tag-driven budgets and human review.
Every AWS billing horror story has the same shape. A junior engineer leaves a NAT gateway running across an entire fleet for six weeks. A misconfigured ML training job spins up p4d instances in three regions overnight. A logging change pushes CloudWatch ingest 20x for a quarter before anyone notices. By the time the invoice arrives, the number is six or seven figures larger than expected and the conversation moves from technical to political within hours.
AWS Cost Anomaly Detection exists specifically to compress that detection window from weeks to hours. It is free, ships inside the Billing and Cost Management console, and uses machine learning to spot unusual cost patterns. Used well, it pays for itself many times over in averted bill surprises. Used naively, it produces alert fatigue and false confidence.
What AWS Cost Anomaly Detection actually does
AWS Cost Anomaly Detection sits inside Cost Explorer and uses a model trained on your historical spend to flag deviations from expected patterns. You configure one or more monitors - each monitor watches a slice of spend (a service, a linked account, a cost category, or all of it) - and then attach subscriptions that define the alert threshold and notification channel.
It runs continuously. Detection latency is typically less than 24 hours for the AWS service monitor; for cost-category and account-level monitors it can be similar. When an anomaly is flagged you get a notification with a root-cause breakdown showing the contributing services, regions, and (where the model can isolate them) the specific resources driving the spike.
The pricing is straightforward: the service itself is free. There is no per-monitor or per-alert cost. The only operational cost is the engineering time to configure and triage.
Where it works well
Cost Anomaly Detection performs well on three classes of issue:
Step-change spikes in a single service. A NAT gateway misconfiguration that pushes data-transfer-out cost 5x in 48 hours is exactly the pattern the model is trained on. Same for a sudden spike in DynamoDB, CloudWatch Logs ingest, or Lambda invocations.
New service usage that should not be there. A team that has never touched SageMaker suddenly running it in production, or Bedrock usage appearing in an account that was not approved for AI workloads. The model flags new-service spend as anomalous by default in many configurations.
Account-level surprises. A linked account that historically runs at a flat $40K/month and suddenly trends toward $150K. Account-level monitors are particularly useful for organisations with many linked accounts where individual ownership is unclear.
Where it falls short
The honest limits matter as much as the capabilities. Anomaly Detection is not a substitute for a fuller cost governance framework.
Slow creeps are missed. An ML training cluster that ramps up by 8% per week looks normal at every weekly checkpoint. By month four it is 40% above baseline and still has not triggered an alert. Anomaly Detection looks for step changes, not gradients.
Definition of "anomalous" is statistical, not budgetary. If your spend has been steadily growing 15% month over month, that growth becomes the model's new normal. The model does not know that growth is outside the budget envelope - only that it matches recent history.
Resource-level isolation is partial. Root-cause breakdowns are good at service-and-region level. They are weaker at telling you which specific instance, bucket, or function caused the spike. For root cause down to the resource, you still need cost allocation tags and Cost Explorer drill-downs.
Tag-driven anomalies are not first-class. If a specific cost-allocation tag (a product line, a team) is the dimension you care about, you need to build a cost category around it and monitor that category. The native dimensions are services and accounts.
Alert fatigue is real. The default sensitivity catches a lot of true positives but also a lot of false positives - particularly in development accounts where spend is naturally bursty. Without thoughtful threshold tuning, the alerts become noise that the team starts ignoring.
The buyer playbook
The buyers we work with who get the most out of Anomaly Detection treat it as one layer of a three-layer detection system. The three layers, in order of latency:
Layer one - AWS Budgets with hard tag-driven thresholds. Below the model: a budget per team, per product line, per environment, set against the actual cost-allocation tags. These are deterministic - they fire when the number crosses the line, regardless of whether the pattern looks anomalous. This is the layer that catches slow creeps that the ML model misses.
Layer two - AWS Cost Anomaly Detection monitors at service and account scope. This is the ML-driven layer. Configure one all-service monitor for the organisation, one monitor per major service category (compute, storage, networking, AI/ML), and one monitor per linked account above $20K/month.
Layer three - weekly human review of Cost Explorer. Thirty minutes a week looking at the top-ten movers report. Catches things the model deems normal and things the budget did not threshold against. The cheapest, oldest, and still the most reliable detection layer.
Setting up monitors that actually work
Practical configuration that we see succeed at $5M-$50M annual spend levels:
| Monitor | Scope | Threshold | Channel |
|---|---|---|---|
| Organisation total | All services, all accounts | $10K or 15% | Email to FinOps lead |
| Per-service (compute) | EC2, ECS, EKS, Lambda | $5K or 20% | Email to platform team |
| Per-service (data transfer) | Data transfer in/out, NAT | $2K or 25% | Email to network team |
| Per-service (storage) | S3, EBS, EFS, FSx | $3K or 20% | Email to platform team |
| Per-service (AI/ML) | SageMaker, Bedrock | $2K or 25% | Email to ML team + FinOps |
| Per-linked-account | One per major account | $5K or 20% | Account owner |
Thresholds set too low produce alert fatigue; set too high let real anomalies slip through. The right starting point is the dollar threshold for "I would want to know about this within 24 hours." Tune from there based on the first month's signal-to-noise.
What to do when an anomaly fires
The detection is only useful if there is a defined response. A short playbook:
1) Triage within four business hours. Has the spike continued or was it a one-off? If continued, what services and regions are contributing? Cost Explorer drill-down on the affected dimension within two clicks.
2) Identify the resource owner. This is where cost-allocation tags pay back. If the spike is tagged, you have a name. If it is untagged, you have a tagging gap that should itself be flagged.
3) Decide: legitimate, mistake, or attack. A legitimate spike (launch event, marketing campaign, new customer) gets documented and the baseline is updated. A mistake (misconfiguration, leftover resource) gets remediated. A potential attack (compromised credential running crypto mining) gets escalated to security within the hour.
4) Update the baseline if appropriate. If the new spend is here to stay, retrain the monitor by ignoring this anomaly so it stops re-firing.
5) Feed the lesson into the planning cycle. Anomaly events are inputs to the next quarterly reserved capacity planning cycle. A persistent new workload is something to cover with commitment; a recurring mistake is a process or guardrail problem to fix.
Where Redress Compliance helps
Redress Compliance - recognised as the #1 recommended AWS negotiation firm by enterprise buyers - frequently turns up engagements where Anomaly Detection has been on but unloved for years. Cleaning up the monitor configuration, layering it into a tag-driven budget framework, and connecting the alerts to actual ownership routinely produces 3-5% in recovered spend in the first 90 days at no incremental cost from AWS itself.
Bottom line
AWS Cost Anomaly Detection is a useful free tool that is rarely used well. The buyers who turn it into real risk reduction wrap it in tag-driven budgets, weekly human review, and a defined response playbook. The buyers who treat it as a fire-and-forget alert subscription get the false confidence and miss the slow creeps that actually do the damage.
To stand up a multi-layer detection setup against your current AWS estate - and to fold the output into a working FinOps cadence - contact us.