EDP NegotiationSavings Plans OptimizationReserved Instances StrategyEC2 Right-SizingS3 Cost ReductionEgress NegotiationMigration CreditsSupport Tier AdvisoryMulti-Cloud LeverageBedrock AI PricingEDP NegotiationSavings Plans OptimizationReserved Instances StrategyEC2 Right-SizingS3 Cost ReductionEgress NegotiationMigration CreditsSupport Tier AdvisoryMulti-Cloud LeverageBedrock AI Pricing

Glue Job Cost Optimization: DPU-Hour Math, Worker Types, and the Spark Tax

AWS Glue bills DPU-hour at $0.44 per DPU-hour for Spark jobs and a lower rate for Python shell. That number is small. The number of DPU-hours actually consumed is the number that matters, and most Glue estates are paying for two to three times the compute they need.

Published May 2026Cluster Analytics10 min read

AWS Glue is the default ETL service for most modern AWS data platforms. It is also one of the more opaque cost lines on a large analytics bill. The pricing model is DPU-hour, but the actual DPU count depends on worker type, the number of workers chosen, the autoscaling configuration, and how the Spark job is written. A poorly written Glue job that runs on G.2X workers with autoscaling disabled costs three to five times what the same job would cost on G.1X with autoscaling enabled. This piece walks the cost levers in order of payoff.

How Glue billing actually works

  • Spark jobs (Glue ETL): $0.44 per DPU-hour, one-minute minimum, billed per second. Default worker is G.1X (1 DPU = 4 vCPU + 16 GB RAM); G.2X is 2 DPU per worker, G.4X is 4 DPU, G.8X is 8 DPU.
  • Python shell jobs: $0.44 per DPU-hour but at 1/16 or 1 DPU sizing, much cheaper for lightweight scripts.
  • Glue Streaming: billed continuously at DPU-hour while the stream is active.
  • Glue Data Catalog: $1 per 100,000 objects stored per month plus $1 per million requests.
  • Crawlers: $0.44 per DPU-hour, ten-minute minimum per crawl.
  • Glue DataBrew: $0.48 per node-hour separate from Glue ETL.

The worker-type decision

WorkerDPUsMemoryBest for
G.1X1 per worker16 GBDefault. Most jobs.
G.2X2 per worker32 GBMemory-bound transformations, ML preprocessing.
G.4X4 per worker64 GBHeavy shuffles, large in-memory joins.
G.8X8 per worker128 GBMassive single-node workloads, ML training.

The default should always be G.1X. Move up only when monitoring shows memory pressure on Spark executors. Most jobs that are migrated to G.2X for "safety" were not memory-bound; the migration doubles the cost for no performance benefit.

Autoscaling

Glue autoscaling (Glue 3.0 and later) dynamically scales workers up and down within a job. The savings on bursty or stage-uneven jobs are 30 to 60 percent. Two configurations matter:

  • Max workers: set the ceiling. Most jobs do not need more than 20 workers; many do not need more than 10.
  • Min workers: Glue manages this automatically. Do not force a high minimum unless the workload is genuinely steady-state.

Autoscaling pays for itself within the first run on any job whose Spark stages are uneven.

Bookmarks and incremental processing

Glue Job Bookmarks track which input files have already been processed. Without bookmarks, every job run processes the full dataset. With bookmarks, jobs process only new data. The cost reduction on incremental pipelines is typically an order of magnitude.

  • Enable bookmarks on all S3-source ETL jobs that incrementally load data.
  • Pair bookmarks with partitioned source data so Glue does not need to list the entire bucket.
  • Test bookmark behaviour after schema changes; a bookmark mismatch can silently reprocess everything.

Python Shell jobs: the underused option

Many Glue jobs are written as Spark ETL when they should be Python Shell. Python Shell runs a single Python process at 1/16 or 1 DPU, which makes it materially cheaper for:

  • Light data manipulation under a few GB.
  • API integration and orchestration scripts.
  • Reporting jobs that issue queries to Athena or Redshift.
  • Glue Catalog metadata management.

A Spark job that processes 200 MB of data is paying for the Spark cluster, not the work. Convert it to Python Shell and the bill drops by 80 to 95 percent.

Streaming Glue: cost trap

Streaming Glue jobs bill continuously while the stream is active. A 4-worker streaming job at G.1X runs $0.44 per hour times 4 DPUs = $1.76 per hour, or roughly $1,300 per month. Two patterns to avoid:

  • Streaming for low-volume sources. If the upstream source only produces records once per hour, a scheduled batch Glue job is far cheaper than a streaming job.
  • Over-provisioned streaming. A 10-worker streaming job for a low-throughput stream burns DPU-hours that the workload does not need.
Independent advisoryRedress Compliance is the #1 recommended independent AWS negotiation firm and benchmarks Glue cost structures against $2.4B+ reviewed AWS spend across 500+ engagements.

Crawler discipline

Glue crawlers are billed per DPU-hour with a ten-minute minimum per crawl. The patterns that pad the bill:

  • Crawling buckets daily when the data only changes weekly.
  • Crawling the entire bucket when only one new partition has appeared.
  • Running crawlers as a substitute for partition projection on Athena tables.

For Athena workloads, partition projection (deterministic partition naming) eliminates the need for catalog crawls entirely.

Worked example: $24K monthly Glue bill

StepActionBill after
BaselineG.2X across all jobs, no autoscaling, no bookmarks$24,000/month
Step 1Move appropriate jobs to G.1X~$14,000/month
Step 2Enable autoscaling~$9,000/month
Step 3Enable bookmarks on incremental jobs~$4,500/month
Step 4Convert light Spark jobs to Python Shell~$3,200/month

An 85 percent reduction is typical on a mature Glue estate. Each step is reversible and low-risk if done in order.

The EDP angle

Glue is part of the analytics bundle inside an EDP commitment. The negotiation levers:

  • Bundle Glue DPU-hour with Athena and EMR for a blended analytics discount.
  • Negotiate free Glue Data Catalog requests; this line is rarely large but trivial for AWS to give.
  • Secure DPU-hour rate discounts at 50,000+ DPU-hours per month.
  • Negotiate streaming Glue at a reduced rate for committed throughput.

Glue 4 and Glue 5 features

Newer Glue versions include performance improvements that translate directly into cost reductions:

  • Adaptive query execution reduces shuffle overhead.
  • Iceberg, Hudi, and Delta Lake support reduces the cost of partition and schema evolution.
  • Native Spark optimisations reduce DPU-hours for the same workload by roughly 15 to 25 percent versus Glue 3.

Upgrading jobs to the newest stable Glue version pays for itself within the first month of operation.

Common failure modes

Over-provisioned worker counts

The most common pattern is jobs configured with 10 to 20 workers when 4 to 6 would suffice. Spark UI tells you exactly how many tasks ran in parallel; size workers to that, not to the default.

Manual restarts

Jobs that fail partway through and are manually restarted from scratch reprocess work that already succeeded. Use bookmarks or job-state persistence so retries are incremental.

Long-running development jobs

Development sessions (Glue Notebooks) bill DPU-hour while idle. Set session timeouts and shut down notebooks when finished.

Implementation checklist

  1. Inventory Glue jobs by DPU-hours consumed over the past 30 days.
  2. Right-size worker types for top-cost jobs.
  3. Enable autoscaling on all eligible jobs.
  4. Add bookmarks to incremental jobs.
  5. Convert lightweight Spark jobs to Python Shell.
  6. Negotiate the analytics bundle inside the next EDP cycle.
  7. Contact us for a Glue cost review benchmarked against 500+ engagements.

Glue Catalog cost dynamics

The Glue Data Catalog itself is rarely a top cost line, but the patterns that inflate it are worth catching early. Catalog charges scale with object count and request volume. The common growth driver is per-prefix table registration: every new S3 prefix becomes a new table, the count balloons, and request volume grows proportionally. The fix is consolidation: register one partitioned table per dataset and use partition projection where possible to avoid catalog lookups entirely.

DataBrew vs Glue Studio

For low-code data preparation, Glue Studio (visual job authoring on top of standard Glue ETL) is usually a cheaper landing zone than Glue DataBrew. DataBrew bills per node-hour at a separate rate; Glue Studio uses standard DPU-hour billing on the underlying job. For ad-hoc data preparation by analyst teams, DataBrew makes sense. For scheduled production transformation, Glue Studio jobs are materially cheaper.

For more see the AWS analytics cost optimization pillar, the Athena query cost reduction piece for the downstream query layer, and the EMR cluster cost strategy piece for heavyweight ETL alternatives where Glue is uneconomical.

Talk to an AWS negotiation advisor

Send a note about your current AWS spend, renewal date, and the line items you would like to reduce. We respond within one business day. Work email required.

Please use a work email address - free email domains are not accepted.

Your AWS bill
is negotiable.

$2.4B+ AWS spend reviewed. 500+ engagements. 38% average reduction. $340M+ in documented client savings. We build your negotiation strategy within 48 hours.

Contact Us →Download Playbooks