If you want to reduce your AWS bill and you do not have a dedicated FinOps team, you almost certainly do not need one. Most small-team AWS bills I have audited are running 30-50% waste, and the cause is never exotic. It is an EC2 instance someone sized for a launch that never happened, gp2 volumes that should have been gp3 two years ago, no Savings Plan on a workload that has not changed since 2023, and a NAT Gateway quietly charging per gigabyte for traffic that never needed to leave the VPC. The fix is boring, which is exactly why it works: turn on the visibility tools, find the idle and over-provisioned resources, and clean them up in priority order. I have done this on accounts from $400/month to $40k/month and the first pass routinely cuts 30% or more.
Where is the money actually going?
You cannot cut what you cannot see, and the AWS console homepage does not show you. Before changing a single resource, turn on Cost Explorer and create a budget with an alert. Cost Explorer has a one-time enable step (it backfills up to 12 months of history once on), and AWS Budgets gives you an email the moment spend crosses a threshold instead of a surprise on the invoice. Set the alert at 80% of your expected monthly spend so it fires while you can still act.
# Email alert at 80% of a $1000/month budget.
# Run once. ACCOUNT_ID and your email are the only things to change.
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
cat > budget.json <<'EOF'
{
"BudgetName": "monthly-all-up",
"BudgetLimit": { "Amount": "1000", "Unit": "USD" },
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}
EOF
cat > notifications.json <<'EOF'
[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{ "SubscriptionType": "EMAIL", "Address": "you@example.com" }
]
}
]
EOF
aws budgets create-budget \
--account-id "$ACCOUNT_ID" \
--budget file://budget.json \
--notifications-with-subscribers file://notifications.jsonThe second half of visibility is tagging. If every resource carries a Project tag, Cost Explorer can group spend by project and you can finally answer "what does the staging environment cost us?" without guessing. Activate the tag as a cost allocation tag in the Billing console once, then enforce it on new resources. Tags applied today do not retroactively label last month's spend, so the sooner you start the sooner the reports become useful.
How do I right-size over-provisioned EC2 and RDS?
This is the single biggest line item on most bills and the most over-provisioned. The instinct is to size for peak-of-peak and never look again. CloudWatch tells you the truth. Pull the last two weeks of CPU and look at the p95, not the average. If a t3.xlarge sits at 8% CPU with peaks under 25%, it is doing the work of a t3.medium and costing four times as much.
# Max CPU per hour over 14 days for one instance.
# If the max barely moves, the instance is oversized.
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--start-time "$(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 3600 \
--statistics Maximum \
--query 'sort_by(Datapoints,&Timestamp)[].Maximum'AWS Compute Optimizer does this analysis for you across the whole account and recommends a target instance type with the projected saving; turn it on and read its EC2 and RDS findings before you start guessing. The gotcha: do not downsize blind. Memory is invisible to the default EC2 CloudWatch metrics, so a box that looks idle on CPU may be memory-bound and will start swapping the moment you shrink it. Test the new size in staging under realistic load first, and change one instance at a time so you can correlate any regression. For RDS the same rule applies, plus one extra: changing the DB instance class triggers a reboot, so schedule it in a maintenance window.
Why move EBS gp2 to gp3?
gp3 is the rare optimization that is both cheaper and faster, which is why it is the first change I make on any account. gp2 ties IOPS to volume size, so the only way to get more throughput was to over-allocate storage you did not need. gp3 decouples them: every volume ships with a 3,000 IOPS / 125 MB/s baseline included in the storage price, which is roughly 20% cheaper per GB than gp2. The migration is a live ModifyVolume with no downtime and no detach.
# Find every gp2 volume, then convert it in place. No detach, no reboot.
aws ec2 describe-volumes \
--filters Name=volume-type,Values=gp2 \
--query 'Volumes[].VolumeId' --output text \
| tr '\t' '\n' \
| while read -r vol; do
echo "Converting $vol -> gp3"
aws ec2 modify-volume --volume-id "$vol" --volume-type gp3
doneRun that during business hours if you like; the volume stays attached and serving I/O while it transitions through the optimizing state. The only volumes to leave alone are ones already on io1/io2 because they genuinely need provisioned IOPS above what gp3 offers.
Should I commit with Savings Plans or Reserved Instances?
After you have right-sized so you are not committing to waste, cover your steady baseline with a commitment. On-Demand is the most expensive way to run anything that runs 24/7. For most teams a Compute Savings Plan is the right tool: you commit to a dollar-per-hour spend for one or three years and AWS discounts up to roughly 66%, and unlike a standard Reserved Instance the discount follows you across instance family, size, region, and even Fargate and Lambda. That flexibility matters when you are still changing your architecture.
- Compute Savings Plan: most flexible, covers EC2, Fargate, and Lambda across families and regions. The default choice for a small team that may still re-architect.
- EC2 Instance Savings Plan: a deeper discount but locks you to an instance family in one region. Only worth it for a workload you are certain will not move.
- Reserved Instances: still the path for RDS, ElastiCache, and OpenSearch, which Compute Savings Plans do not cover. Buy these for your steady database baseline.
- Commit to the trough, not the peak: cover only the capacity you run around the clock, and let On-Demand or Spot absorb the spikes.
Start with a one-year, no-upfront commitment sized to your minimum baseline. The Cost Explorer recommendations page reads your actual usage and proposes a commitment amount; treat it as a starting point and round down, because an over-committed plan you cannot use is just prepaid waste.
What are the silent money pits?
These do not show up as a big obvious instance, which is exactly why they survive. The worst offender is the NAT Gateway. It charges an hourly rate plus a per-gigabyte data processing fee on everything that passes through it, so a chatty service pulling packages or talking to S3 over the NAT can quietly run into hundreds of dollars a month. Route S3 and DynamoDB traffic through a free Gateway VPC Endpoint instead of the NAT, and if you have one NAT per Availability Zone for redundancy you do not strictly need, consolidate.
Data transfer is the other silent killer. Traffic between Availability Zones is billed in both directions, and traffic out to the internet is billed per gigabyte. Chatty cross-AZ replication or an app server in AZ-a hammering a database in AZ-b adds up. Then there is the dead weight: unattached EBS volumes from terminated instances, EBS snapshots from a backup script that never prunes, and load balancers pointing at empty target groups. Each one bills whether or not anything uses it.
# Unattached EBS volumes - billing for storage nobody uses.
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[].{ID:VolumeId,GiB:Size,Created:CreateTime}' \
--output table
# Your own snapshots, oldest first - prune past the retention you need.
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ec2 describe-snapshots --owner-ids "$ACCOUNT_ID" \
--query 'sort_by(Snapshots,&StartTime)[].{ID:SnapshotId,Started:StartTime,GiB:VolumeSize}' \
--output table
# Load balancers with no healthy targets are still on the clock.
aws elbv2 describe-load-balancers \
--query 'LoadBalancers[].{Name:LoadBalancerName,DNS:DNSName}' --output tableVerify before you delete. An "unattached" volume might be a deliberate cold spare, and a snapshot might be the only copy of something. Confirm a snapshot exists before deleting a volume. If you do not already have a disciplined backup lifecycle, set one up so snapshots prune themselves instead of accumulating; I walk through that in automating EC2 snapshots and backups.
How do I get old S3 data to stop costing full price?
S3 Standard charges the same per gigabyte for a log file from 2022 nobody will ever read as it does for hot data. Two features fix this. S3 Intelligent-Tiering watches access patterns and moves objects to cheaper tiers automatically for a small monitoring fee, which is the safe default when you do not know your access pattern. A lifecycle rule is the deterministic option: transition objects to a cheaper storage class after N days, then expire them when they are no longer needed at all.
{
"Rules": [
{
"ID": "archive-then-expire-logs",
"Filter": { "Prefix": "logs/" },
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 365 }
}
]
}aws s3api put-bucket-lifecycle-configuration \
--bucket my-app-logs \
--lifecycle-configuration file://lifecycle.jsonMind the retrieval cost gotcha: Glacier Flexible Retrieval and Glacier Deep Archive are dirt cheap to store but you pay to retrieve, and there is a minimum storage duration (90 days for Glacier Flexible Retrieval, 180 for Deep Archive) where early deletion still bills the full period. Only push data to Glacier that you genuinely will not touch. For unknown access patterns, Intelligent-Tiering is the lower-risk call. AWS documents the full transition matrix in the official S3 storage class guidance at https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html.
Nobody ever got paged because an instance was too small in staging. They get paged for the invoice. Right-size in a window you control, not the one accounting picks for you.
The prioritized checklist
Work top to bottom. The early items are zero-risk and pay for the time you spend on the rest.
- Turn on Cost Explorer and create a Budget with an 80% email alert. Costs nothing, takes ten minutes.
- Tag every resource with a Project tag and activate it as a cost allocation tag. You cannot optimize what you cannot attribute.
- Convert all gp2 EBS volumes to gp3. Cheaper and faster, live, no downtime. Lowest-risk win on the list.
- Delete unattached EBS volumes, prune old snapshots, and remove load balancers with empty target groups - after confirming each is truly dead.
- Replace NAT-routed S3/DynamoDB traffic with free Gateway VPC Endpoints, and review cross-AZ chatter.
- Right-size over-provisioned EC2 and RDS using Compute Optimizer and CloudWatch p95 - test in staging, change one at a time.
- Apply S3 lifecycle rules or Intelligent-Tiering to old objects.
- Buy a one-year no-upfront Compute Savings Plan sized to your steady baseline - last, after the waste is already gone.
Notice the commitment is the last step, not the first. Buying a Savings Plan before right-sizing just locks in your current waste for a year. Clean up the idle and oversized resources, then commit to what is left. If you are still deciding how to host the workload in the first place, EC2 vs Lightsail: which to use covers when a flat Lightsail price beats the EC2 a-la-carte model for a small project, and once you are on EC2, my AWS production architecture walkthrough shows the layout I actually deploy. Cost optimization is not a one-off project; it is a quarterly habit. Set the budget alert, tag everything, and do a 30-minute idle-resource sweep every few weeks. The discipline is what keeps the 30% you just clawed back from creeping straight back in.

