← Back to Blog

ElastiCache Cost: What We Found Across 50 Accounts

ElastiCache cost analysis

ElastiCache clusters are among the most consistently over-provisioned resources in AWS accounts. They are provisioned at peak capacity during a feature launch, their configuration is locked in place once the feature ships, and they receive no subsequent review unless they cause a production incident. The result is a fleet of caches running at 15-25% utilization, with read replicas nobody queries, node sizes that reflect traffic levels from 18 months ago, and Multi-AZ configurations applied uniformly regardless of whether the cache is actually on a critical path.

After analyzing ElastiCache configurations across 50 customer accounts, three patterns account for the majority of the optimization opportunity. The total ElastiCache savings identified across those 50 accounts averaged $2,400/month per account, ranging from $180 for smaller deployments to $14,000 for accounts with large Redis clusters serving multiple high-traffic services.

Pattern 1: Read Replicas With Zero Traffic

ElastiCache Redis read replicas are created to distribute read traffic across multiple nodes and provide failover capacity. For caches supporting genuinely high-read traffic, read replicas are justified. The problem is that read replicas persist long after the traffic justification disappears.

The specific scenario we see most frequently: a team adds read replicas during a traffic spike (a product launch, a marketing campaign, a viral event). The replicas handle the load successfully. The traffic spike subsides. The replicas remain. Three months later, the original event is forgotten and the replicas look like part of the normal infrastructure.

Identifying orphaned read replicas is straightforward: ElastiCache provides CacheHits and CacheMisses metrics per node in CloudWatch. A read replica that shows consistently near-zero CacheHits and CacheMisses over a 30-day period is receiving no traffic. It is providing capacity that is never used. The cost depends on node size — an r6g.large read replica costs approximately $124/month On-Demand — but accounts with multiple such replicas across multiple clusters can see $500-$2,000/month in orphaned replica charges.

The remediation is straightforward: remove the unused replicas. Before doing so, verify that the application's ElastiCache connection string is not explicitly targeting the replica endpoint for read operations. Some applications configure read-from-replica explicitly and require a connection string update when replicas are removed. This is worth checking before removal to avoid a failed connection error in production.

Pattern 2: Multi-AZ on Non-Critical Caches

ElastiCache Multi-AZ provides automatic failover by maintaining a replica in a separate availability zone. For a cache on the critical path of a user-facing request — session storage, rate limiting, real-time leaderboard data — Multi-AZ is appropriate. For a cache storing pre-computed analytics data that can be regenerated from the database in 30 seconds if the cache fails, Multi-AZ doubles the node cost without meaningfully improving application availability.

In practice, Multi-AZ is applied uniformly because it is the "safe" default. The engineer provisioning the cache does not want to be responsible for a cache failure causing a production incident. Multi-AZ is applied without analysis of whether the cache is actually on a path where a 30-second regeneration delay would be user-visible.

Across the accounts we analyzed, 38% of Multi-AZ ElastiCache clusters were serving data that could be regenerated from the primary database in under 60 seconds with no user-visible impact. Disabling Multi-AZ on those clusters reduces their cost by 50% immediately. The savings calculation is the same as for removing a read replica — one fewer node of the same size. For r6g.xlarge clusters ($248/month per node), removing unnecessary Multi-AZ saves $248/month per cluster.

The analysis required to categorize caches correctly is a one-time effort: for each cache, identify what data it stores, identify the source of that data (database, computed result, session), and assess the application behavior if the cache is unavailable for 30-60 seconds during a failover. Cache clusters that can tolerate a brief unavailability without user-visible impact are candidates for single-AZ configuration.

Pattern 3: Clusters Serving Deprecated Features

The most dramatic ElastiCache optimization opportunities we find are clusters that are running at near-zero utilization across all metrics — CacheHits, CacheMisses, NetworkBytesIn, NetworkBytesOut — because the feature they support has been deprecated or removed from the application.

Feature deprecation rarely triggers a complete infrastructure cleanup. The application code is removed or disabled, traffic stops flowing to the cache, but the ElastiCache cluster remains because nobody explicitly decommissioned it. The engineering team responsible for the feature has moved on to other projects. The infrastructure is still tagged to the deprecated feature's cost center. Nobody is accountable for cleaning it up.

Identifying these clusters requires looking at a metric combination: near-zero CacheHits and CacheMisses sustained over 30+ days, combined with near-zero network I/O. A cache that receives zero connections for 30 days is functionally unused. If the cluster is running an r6g.2xlarge configuration ($497/month per node, plus any replicas), this is a significant orphaned cost.

Termination requires confirming that no application component still references the cluster's endpoint. The safest approach is to set a security group rule that denies all inbound traffic to the cache for 48 hours and monitor for any CloudWatch alarms or application errors that fire during that period. If none fire, the cluster is safe to terminate. This two-step approach takes an extra 48 hours but eliminates the risk of terminating a cache that a background process or monitoring system still depends on.

Node Right-Sizing: The Overlooked Dimension

Beyond the three structural patterns above, ElastiCache node sizes are frequently over-provisioned. The memory utilization metric (DatabaseMemoryUsagePercentage in CloudWatch) for many clusters sits below 40% sustainably, indicating that the allocated node size was set at provisioning time based on anticipated growth rather than actual usage.

ElastiCache node right-sizing is lower-risk than EC2 right-sizing because the performance impact is more predictable. For a Redis cluster where the primary cost is memory (stored data fits in a smaller node type), downsizing from r6g.xlarge to r6g.large reduces cost by 50% with the only risk being that the stored dataset grows to exceed the smaller node's capacity. That risk is manageable with a DatabaseMemoryUsagePercentage alert set at 70%.

The typical recommendation threshold we use: if DatabaseMemoryUsagePercentage stays below 35% over a 90-day window, the cluster is a candidate for one step down in node size. At 35% usage on an r6g.xlarge, the dataset fits comfortably in an r6g.large with headroom for reasonable growth. If the cluster size was chosen for write throughput rather than memory capacity, the throughput metrics need to be reviewed separately — but for most Redis deployments used for caching rather than pub/sub, memory is the binding constraint.

Scheduling ElastiCache for Non-Production Environments

Non-production ElastiCache clusters — cache clusters attached to development, staging, and QA environments — can be scheduled the same way EC2 instances can be scheduled. ElastiCache does not have native scheduling support, but clusters can be stopped (for Redis 7.x) or deleted and recreated using IaC for older versions.

For Redis clusters using version 7.x or later, AWS supports stopping and restarting clusters, which preserves data across the stop/start cycle. For older versions, stopping and restarting wipes the cache data, which is acceptable for non-production caching workloads where cache warm-up is fast.

Non-production ElastiCache scheduling combined with the structural optimizations above — removing orphaned replicas, disabling unnecessary Multi-AZ, right-sizing over-provisioned nodes — routinely reduces ElastiCache spend by 40-60% in accounts that have not undergone a recent review. As we describe in our guide to non-production scheduling, the principle is the same: identify what runs when nobody is using it and give it a schedule that reflects actual usage patterns.

Audit Frequency: ElastiCache Needs Quarterly Review

The three patterns described in this article recur because ElastiCache clusters are provisioned and forgotten. The solution is not just to run a one-time cleanup — it is to establish a quarterly review cadence for ElastiCache that checks for new occurrences of these patterns as the account evolves.

A quarterly ElastiCache audit takes approximately two hours for an account with 20-30 clusters if the CloudWatch data is already collected and aggregated. The audit checks: any clusters with zero traffic in the past 30 days, any Multi-AZ clusters where the data can be regenerated in under 60 seconds, any clusters where DatabaseMemoryUsagePercentage is below 35% sustained, and any read replicas where per-node cache hit rates are near zero. Four checks, two hours, recurring savings from each audit cycle.

Identify your ElastiCache optimization opportunities

KernelRun's resource discovery scans ElastiCache clusters for orphaned replicas, unnecessary Multi-AZ, and over-provisioned node sizes within 15 minutes of connecting. Connect your first account in 4 minutes.

Request a Demo