Lyra

Online
AWS Cost Optimization: Cutting Spend Without Cutting Reliability (2020)
AWS 9 min read

AWS Cost Optimization: Cutting Spend Without Cutting Reliability (2020)

S
Squalltec Team November 22, 2020

TL;DR

AWS cost optimization that actually lasts is not a one-time “rightsizing sprint.” It’s a product: an operating model, a measurement system, and a set of guardrails. The fastest path to meaningful, safe savings:

  • Establish visibility: tagging, cost allocation, and ownership per service/team
  • Define unit economics: cost per booking, per transaction, per active property, per customer
  • Attack the big levers: compute, storage, data transfer, and managed databases
  • Add guardrails: budgets, anomaly detection, policy-as-code, and change reviews
  • Institutionalize: weekly cost review, monthly capacity planning, quarterly architecture refresh

This guide is structured so you can implement it in phases without “breaking production to save money.”

Why Cloud Bills “Suddenly” Explode

Cloud costs rarely spike because engineers are careless. They spike because cloud pricing rewards precision and punishes drift:

  • A service ships quickly, then grows usage 10x
  • Environments multiply (dev/stage/preview/feature branches)
  • Data grows (logs, metrics, snapshots, object storage)
  • Network costs appear later (egress, cross-AZ, NAT)
  • Nobody “owns” the bill, so nobody optimizes it

If your cost strategy is “we’ll optimize later,” later arrives as a surprise invoice.

The Mindset Shift: Cost Is a Reliability Constraint

Treat cost like latency and availability:

  • You don’t accept random latency regressions
  • You don’t accept unknown availability risk
  • You shouldn’t accept unknown cost drift

If you want to keep reliability while reducing spend, you need measurable targets and safe processes.

Phase 1 (Week 1): Make Cost Visible and Actionable

1) Tagging That Engineers Actually Follow

Tagging is not an accounting exercise. It’s how you make optimization possible. Start with a minimal, enforceable standard:

  • env: prod | staging | dev
  • service: canonical service/app name
  • owner: team or squad
  • cost_center: business unit (optional but helpful)
  • data_class: public | internal | restricted (helps with compliance + storage decisions)

The goal is to answer, in minutes:

  • Which services are driving spend?
  • Which team owns the cost?
  • Which environment is leaking?
  • What changed last week?

2) Create a Cost Ownership Map

Make a table (even a spreadsheet) that maps:

  • Service → owner → on-call channel → cost target
  • Shared platform costs → allocation method (percentage, usage, or flat)

Do not over-engineer allocation on day one. Do make ownership explicit.

3) Introduce Weekly Cost Review (30 Minutes)

Agenda template:

  • Top 5 spenders by service
  • Top 5 week-over-week changes
  • Any anomalies (unexpected spikes)
  • Actions: assign 1–3 concrete optimizations with owners

If you can’t run this meeting, you’re not ready to optimize. You’re guessing.

Phase 2 (Weeks 2–3): Define Unit Economics (So Savings Don’t Rebound)

“We reduced AWS by 15%” is not a strategy. It’s a moment in time.

You want:

  • Cost per booking
  • Cost per transaction
  • Cost per active customer
  • Cost per GB processed
  • Cost per property per month (hospitality)

When you track unit economics, cost stays proportional as you scale. Without it, costs rebound the moment growth returns.

Practical Example: Cost Per Booking

If you run a booking platform, you can estimate:

  • Total monthly infra cost for booking services
  • Total monthly completed bookings
  • Cost per booking = infra / bookings

Then you can correlate changes:

  • A new feature increased CPU usage
  • A new analytics pipeline increased data transfer
  • A partner integration increased retry volume

Phase 3: The High-Impact Cost Levers (Without Breaking Production)

Most savings come from four categories:

  1. Compute
  2. Storage
  3. Data transfer
  4. Managed databases and caching

Everything else is usually second-order.

Compute Optimization: EC2, Containers, Serverless

The Three Compute Questions

For each workload ask:

  1. Do we need always-on capacity?
  2. Do we need predictable performance?
  3. Do we need bursty scaling?

Your answers determine whether EC2, containers, or serverless is the better fit.

Rightsizing Without Guessing

Rightsizing should be driven by observed utilization and SLOs:

  • CPU utilization (average and p95)
  • Memory utilization (average and p95)
  • Request latency (p95/p99)
  • Error rates

If you rightsize purely on CPU average, you can break latency during bursts. If you rightsize purely on p99, you might overpay. The goal is a balanced envelope tied to your SLOs.

Common “Silent Waste” Patterns

Always-On Dev/Staging

If dev/stage mirrors prod “just in case,” you pay twice. Options:

  • Auto-schedule non-prod to stop outside working hours
  • Use smaller footprints for non-prod
  • Replace staging replicas with synthetic load testing windows

Over-Provisioned Baselines

Teams often set baselines to avoid pages. The fix is not “cut capacity.” The fix is:

  • Define SLO-based autoscaling policies
  • Add load tests and canaries
  • Create rollback plans

“Small” Instances That Multiply

The most expensive architecture is “lots of small things with no lifecycle.” Preview environments, temporary workers, and forgotten test clusters add up.

Solution:

  • Time-to-live on ephemeral infrastructure
  • Owner tags required
  • Automated cleanup jobs

Capacity Planning: The Boring Superpower

Reliability-preserving savings usually come from planning:

  • What’s your expected growth?
  • What are seasonal spikes?
  • Where is headroom required?

Even simple forecasting prevents reactive, expensive scaling.

Commitment Discounts: Savings Plans vs Reserved Instances (Conceptual Guide)

Commitment discounts can deliver major savings, but only after you have visibility.

Use a decision approach:

  • If your workload is steady: consider commitments
  • If your workload is highly variable: focus on autoscaling and architecture first
  • If you’re migrating aggressively: avoid locking into assumptions too early

Rules of thumb:

  • Start small and increase commitments as confidence grows
  • Commit on the portion of usage you are sure you’ll keep
  • Revisit commitments regularly as architecture changes

Containers: EKS/Fargate Efficiency Checklist

Container platforms introduce a new source of waste: requested resources that do not match actual usage.

Checklist:

  • Tune CPU/memory requests based on real usage
  • Use horizontal autoscaling with meaningful metrics
  • Avoid “one big node group forever”
  • Rightsize node types and scale-out strategies
  • Reduce over-replication in non-prod

Operational advice:

  • Avoid optimizing in the dark. Add dashboards first.
  • Optimize one service at a time. Measure before and after.

Serverless: Hidden Costs and Smart Wins

Serverless can be cost-effective, but it can also surprise you:

  • High invocation volume
  • Inefficient code causing longer duration
  • Excessive cold starts triggering over-provisioning

Cost wins:

  • Reduce duration (optimize code paths)
  • Reduce data scanned and transferred
  • Use caching and batching where appropriate

Storage Optimization: S3, EBS, Snapshots, and Logs

Storage costs are the long-term compounding category. Compute can be optimized quickly; storage becomes a swamp if ignored.

S3: Lifecycle Policies as Default

If you have objects older than 30–90 days, you need a lifecycle policy. Typical tiers:

  • Hot: recently accessed objects
  • Warm: occasional access
  • Cold: rarely accessed
  • Archive: compliance retention

The exact storage class depends on your access patterns and retrieval requirements, but the principle is simple: older data should not live in your most expensive tier by default.

The “Logs and Metrics Explosion”

Teams often discover that:

  • Logs are retained too long
  • High-cardinality metrics explode cost
  • Debug logs were left enabled in prod

Guardrails:

  • Keep default retention short, extend by exception
  • Sample high-volume logs
  • Reduce cardinality in metrics labels

Snapshots and Backups

Backups are non-negotiable. Unmanaged backup sprawl is optional.

Rules:

  • Define retention per environment
  • Regularly prune old snapshots
  • Keep audit logs of retention policy changes

Data Transfer: The Cost People Notice Last

Data transfer can become the largest surprise category because it grows with usage and architecture complexity.

Typical Transfer Drivers

  • Internet egress (traffic leaving AWS)
  • Cross-AZ traffic (chatty microservices, database calls)
  • NAT gateway usage (egress from private subnets)
  • Inter-region replication and calls

Practical Ways to Reduce Transfer Cost (Safely)

  • Cache static and semi-static content at the edge (CDN)
  • Reduce chatty service calls (batching, caching, pagination)
  • Keep latency-sensitive dependencies within the same zone/placement strategy when appropriate
  • Use private connectivity patterns where it reduces expensive egress patterns

The key is to optimize transfer by improving architecture, not by “turning off features.”

Databases: The Quiet Budget Eater

Managed databases are usually worth it for reliability and operations. They can still be optimized.

Common Database Cost Issues

  • Over-provisioned instances “for safety”
  • Lack of query optimization causing CPU growth
  • Storage growth without archiving
  • Read replicas added and never removed

Safe optimization sequence:

  1. Improve queries and indexes
  2. Add caching for repeated reads
  3. Rightsize instances after performance improves
  4. Review replicas and retention policies

Caching: Pay for Less Database

Caching is not just performance. It’s cost control:

  • Fewer database reads
  • Fewer replicas
  • Smaller database instances

But only if you define cache invalidation and consistency expectations clearly.

Guardrails: How You Keep Savings From Rebounding

Budgets and Alerts With Ownership

Alerts without owners are noise. Every alert needs:

  • A threshold
  • A channel (team) to receive it
  • A runbook: what to check first

Anomaly Detection and “Cost Incidents”

Treat major cost spikes like incidents:

  • Identify the change
  • Roll back if needed
  • Add a guardrail so it can’t happen again

Policy-as-Code (Where Practical)

Examples of enforceable rules:

  • Non-prod resources must have schedules or TTL
  • Storage must have lifecycle policies
  • New services must publish cost dashboards

The principle: make the safe path the default path.

A Repeatable FinOps Operating Model (Lightweight)

Roles

  • Engineering owner: accountable for optimization actions
  • FinOps partner: supports reporting and governance
  • Product owner: balances cost vs experience and roadmap priorities

Rituals

  • Weekly cost review (tactical)
  • Monthly capacity and architecture review (strategic)
  • Quarterly “cost + reliability” retro (systems thinking)

Metrics That Matter

  • Unit economics (cost per booking/transaction)
  • Week-over-week cost deltas by service
  • Reliability metrics (latency/error) alongside cost
  • “Waste” metrics (idle resources, unused storage, orphaned snapshots)

Quick Wins Checklist (Copy/Paste)

In 7 Days

  • Enforce minimal tags (env, service, owner)
  • Identify top 5 services by cost
  • Set non-prod schedules where safe
  • Reduce log retention defaults

In 30 Days

  • Add unit economics dashboard
  • Rightsize 1–3 major services
  • Implement S3 lifecycle policies for aging data
  • Add budgets and anomaly detection with clear owners

In 90 Days

  • Standardize cost reviews as a habit
  • Refactor chatty service calls that drive transfer costs
  • Optimize database queries and caching strategy
  • Introduce policy checks in CI for new infrastructure

FAQ

Will cost optimization reduce reliability?

Not if you do it correctly. The reliability risks come from blind rightsizing and premature commitments. Visibility, SLOs, gradual changes, and rollbacks keep you safe.

What’s the biggest mistake teams make?

Optimizing spend without assigning ownership. Costs rebound when nobody “owns” the bill as part of their service responsibility.

What’s the biggest savings lever in practice?

Usually compute and storage, followed by data transfer. The exact mix depends on architecture and lifecycle maturity.

Closing Thought

The goal is not “a cheaper cloud bill.” The goal is predictable, proportional cost as you scale. When cost becomes measurable and owned, optimization becomes continuous—and reliability stays intact.

Start with visibility

  • Break cost down by environment and service
  • Tag resources by product, owner, and purpose
  • Track cost per customer or per transaction

High-impact cost levers

  • Rightsize compute based on real utilization
  • Use autoscaling with safe minimums
  • Choose the right storage class and lifecycle rules
  • Reduce data transfer by caching and regional design

Guardrails

Cost reduction must preserve:

  • Availability targets
  • Performance targets
  • Security posture