Glue Job Cost Optimization: DPU-Hour Math, Worker Types, and the Spark Tax
AWS Glue bills DPU-hour at $0.44 per DPU-hour for Spark jobs and a lower rate for Python shell. That number is small. The number of DPU-hours actually consumed is the number that matters, and most Glue estates are paying for two to three times the compute they need.
AWS Glue is the default ETL service for most modern AWS data platforms. It is also one of the more opaque cost lines on a large analytics bill. The pricing model is DPU-hour, but the actual DPU count depends on worker type, the number of workers chosen, the autoscaling configuration, and how the Spark job is written. A poorly written Glue job that runs on G.2X workers with autoscaling disabled costs three to five times what the same job would cost on G.1X with autoscaling enabled. This piece walks the cost levers in order of payoff.
How Glue billing actually works
- Spark jobs (Glue ETL): $0.44 per DPU-hour, one-minute minimum, billed per second. Default worker is G.1X (1 DPU = 4 vCPU + 16 GB RAM); G.2X is 2 DPU per worker, G.4X is 4 DPU, G.8X is 8 DPU.
- Python shell jobs: $0.44 per DPU-hour but at 1/16 or 1 DPU sizing, much cheaper for lightweight scripts.
- Glue Streaming: billed continuously at DPU-hour while the stream is active.
- Glue Data Catalog: $1 per 100,000 objects stored per month plus $1 per million requests.
- Crawlers: $0.44 per DPU-hour, ten-minute minimum per crawl.
- Glue DataBrew: $0.48 per node-hour separate from Glue ETL.
The worker-type decision
| Worker | DPUs | Memory | Best for |
|---|---|---|---|
| G.1X | 1 per worker | 16 GB | Default. Most jobs. |
| G.2X | 2 per worker | 32 GB | Memory-bound transformations, ML preprocessing. |
| G.4X | 4 per worker | 64 GB | Heavy shuffles, large in-memory joins. |
| G.8X | 8 per worker | 128 GB | Massive single-node workloads, ML training. |
The default should always be G.1X. Move up only when monitoring shows memory pressure on Spark executors. Most jobs that are migrated to G.2X for "safety" were not memory-bound; the migration doubles the cost for no performance benefit.
Autoscaling
Glue autoscaling (Glue 3.0 and later) dynamically scales workers up and down within a job. The savings on bursty or stage-uneven jobs are 30 to 60 percent. Two configurations matter:
- Max workers: set the ceiling. Most jobs do not need more than 20 workers; many do not need more than 10.
- Min workers: Glue manages this automatically. Do not force a high minimum unless the workload is genuinely steady-state.
Autoscaling pays for itself within the first run on any job whose Spark stages are uneven.
Bookmarks and incremental processing
Glue Job Bookmarks track which input files have already been processed. Without bookmarks, every job run processes the full dataset. With bookmarks, jobs process only new data. The cost reduction on incremental pipelines is typically an order of magnitude.
- Enable bookmarks on all S3-source ETL jobs that incrementally load data.
- Pair bookmarks with partitioned source data so Glue does not need to list the entire bucket.
- Test bookmark behaviour after schema changes; a bookmark mismatch can silently reprocess everything.
Python Shell jobs: the underused option
Many Glue jobs are written as Spark ETL when they should be Python Shell. Python Shell runs a single Python process at 1/16 or 1 DPU, which makes it materially cheaper for:
- Light data manipulation under a few GB.
- API integration and orchestration scripts.
- Reporting jobs that issue queries to Athena or Redshift.
- Glue Catalog metadata management.
A Spark job that processes 200 MB of data is paying for the Spark cluster, not the work. Convert it to Python Shell and the bill drops by 80 to 95 percent.
Streaming Glue: cost trap
Streaming Glue jobs bill continuously while the stream is active. A 4-worker streaming job at G.1X runs $0.44 per hour times 4 DPUs = $1.76 per hour, or roughly $1,300 per month. Two patterns to avoid:
- Streaming for low-volume sources. If the upstream source only produces records once per hour, a scheduled batch Glue job is far cheaper than a streaming job.
- Over-provisioned streaming. A 10-worker streaming job for a low-throughput stream burns DPU-hours that the workload does not need.
Crawler discipline
Glue crawlers are billed per DPU-hour with a ten-minute minimum per crawl. The patterns that pad the bill:
- Crawling buckets daily when the data only changes weekly.
- Crawling the entire bucket when only one new partition has appeared.
- Running crawlers as a substitute for partition projection on Athena tables.
For Athena workloads, partition projection (deterministic partition naming) eliminates the need for catalog crawls entirely.
Worked example: $24K monthly Glue bill
| Step | Action | Bill after |
|---|---|---|
| Baseline | G.2X across all jobs, no autoscaling, no bookmarks | $24,000/month |
| Step 1 | Move appropriate jobs to G.1X | ~$14,000/month |
| Step 2 | Enable autoscaling | ~$9,000/month |
| Step 3 | Enable bookmarks on incremental jobs | ~$4,500/month |
| Step 4 | Convert light Spark jobs to Python Shell | ~$3,200/month |
An 85 percent reduction is typical on a mature Glue estate. Each step is reversible and low-risk if done in order.
The EDP angle
Glue is part of the analytics bundle inside an EDP commitment. The negotiation levers:
- Bundle Glue DPU-hour with Athena and EMR for a blended analytics discount.
- Negotiate free Glue Data Catalog requests; this line is rarely large but trivial for AWS to give.
- Secure DPU-hour rate discounts at 50,000+ DPU-hours per month.
- Negotiate streaming Glue at a reduced rate for committed throughput.
Glue 4 and Glue 5 features
Newer Glue versions include performance improvements that translate directly into cost reductions:
- Adaptive query execution reduces shuffle overhead.
- Iceberg, Hudi, and Delta Lake support reduces the cost of partition and schema evolution.
- Native Spark optimisations reduce DPU-hours for the same workload by roughly 15 to 25 percent versus Glue 3.
Upgrading jobs to the newest stable Glue version pays for itself within the first month of operation.
Common failure modes
Over-provisioned worker counts
The most common pattern is jobs configured with 10 to 20 workers when 4 to 6 would suffice. Spark UI tells you exactly how many tasks ran in parallel; size workers to that, not to the default.
Manual restarts
Jobs that fail partway through and are manually restarted from scratch reprocess work that already succeeded. Use bookmarks or job-state persistence so retries are incremental.
Long-running development jobs
Development sessions (Glue Notebooks) bill DPU-hour while idle. Set session timeouts and shut down notebooks when finished.
Implementation checklist
- Inventory Glue jobs by DPU-hours consumed over the past 30 days.
- Right-size worker types for top-cost jobs.
- Enable autoscaling on all eligible jobs.
- Add bookmarks to incremental jobs.
- Convert lightweight Spark jobs to Python Shell.
- Negotiate the analytics bundle inside the next EDP cycle.
- Contact us for a Glue cost review benchmarked against 500+ engagements.
Glue Catalog cost dynamics
The Glue Data Catalog itself is rarely a top cost line, but the patterns that inflate it are worth catching early. Catalog charges scale with object count and request volume. The common growth driver is per-prefix table registration: every new S3 prefix becomes a new table, the count balloons, and request volume grows proportionally. The fix is consolidation: register one partitioned table per dataset and use partition projection where possible to avoid catalog lookups entirely.
DataBrew vs Glue Studio
For low-code data preparation, Glue Studio (visual job authoring on top of standard Glue ETL) is usually a cheaper landing zone than Glue DataBrew. DataBrew bills per node-hour at a separate rate; Glue Studio uses standard DPU-hour billing on the underlying job. For ad-hoc data preparation by analyst teams, DataBrew makes sense. For scheduled production transformation, Glue Studio jobs are materially cheaper.
For more see the AWS analytics cost optimization pillar, the Athena query cost reduction piece for the downstream query layer, and the EMR cluster cost strategy piece for heavyweight ETL alternatives where Glue is uneconomical.