← Back to Blog

What We Learned From Our First 50 Customer Accounts

Article illustration

After analyzing 50 engineering teams' cloud accounts through KernelRun's initial access period, some findings were expected and some were not. The expected one: nearly all accounts had identifiable waste exceeding 25% of total compute spend. The unexpected one: the distribution of waste types was far more consistent across accounts than we anticipated. Three categories — non-production scheduling, storage snapshot accumulation, and managed database over-provisioning — account for 61% of the total savings identified across the dataset. The long tail of remaining savings spread across 11 other categories, each representing under 6% of the total individually.

That concentration matters for prioritization. A FinOps program that focuses on these three categories first will capture the majority of available savings before tackling the more complex work of compute right-sizing, commitment purchasing, and architecture-level changes. Here is what the data showed.

Category 1: Non-Production Scheduling (28% of Total Savings)

Non-production scheduling was the single largest savings category, representing 28% of identified savings across the 50 accounts. This was higher than we expected going in. The median savings from non-production scheduling was $3,400/month per account, ranging from $400 for small teams with few environments to $22,000 for accounts running large staging and QA fleets across multiple services.

The pattern is consistent: development and staging environments running 24 hours a day, 7 days a week, with near-zero utilization during nights and weekends. They are not running continuously because the team genuinely needs them always on — they are running continuously because nobody has implemented a schedule. When asked why, the most common answer is "we tried once and it broke something, and we never had time to fix the approach."

The "it broke something" event is almost always a dependency ordering issue at startup — the kind of problem that is straightforward to resolve but requires an upfront analysis of the environment dependency graph. Teams that implement scheduling correctly the first time, with proper startup sequencing and team-based override mechanisms, sustain the schedules without disruption. The ones who skip the dependency analysis end up with broken environments at 9 AM Monday and give up on scheduling entirely.

Category 2: Storage Snapshot Accumulation (19% of Total Savings)

Snapshot accumulation was the second-largest category and the one most teams were least aware of. Across the 50 accounts, the average orphaned snapshot cost was $1,800/month, with several accounts exceeding $8,000/month in snapshots attached to terminated instances or generated by deprecated backup policies that were never cleaned up.

The distribution by account age was striking. Accounts running for more than 3 years had average orphaned snapshot costs of $3,200/month. Accounts running for less than 18 months averaged under $400/month. Snapshot accumulation compounds — it grows continuously as instances are terminated and backup policies continue creating new snapshots with no one cleaning up the old ones. The longer an account has been running without a snapshot audit, the larger the orphaned cost.

Identifying orphaned snapshots requires a specific query: enumerate all volume snapshots in the account, cross-reference source instance identifiers against currently running instances, and flag snapshots whose source no longer exists. This query is straightforward via the provider CLI or SDK but does not surface through a standard cost dashboard view, which is why it goes undetected in most accounts.

One important nuance: not all snapshots of terminated instances should be deleted. Some serve as disaster recovery archives and should be retained regardless of whether the source still exists. The cleanup workflow needs a "retention reason" classification step so that legitimate archive snapshots are not deleted alongside genuine orphans.

Category 3: Managed Database Over-Provisioning (14% of Total Savings)

Managed database over-provisioning was the third-largest category. Unlike compute over-provisioning, which is widely recognized and has tooling support from most providers, database over-provisioning is less frequently analyzed and more likely to persist unreviewed. Across the 50 accounts, the average right-sizing opportunity was $2,100/month, concentrated in three configurations: high-availability mode on non-critical databases, read replicas with zero query traffic, and storage allocation far exceeding actual usage.

The high-availability pattern is applied as a default for production databases regardless of criticality. A nightly batch job database with an 8-hour maintenance window and a recovery time objective measured in hours does not require automatic failover with a standby replica running at full cost in a separate availability zone. Disabling high-availability mode on genuinely non-critical managed databases is the highest-ROI single change in most accounts.

Read replicas with zero query traffic are a close second. They get provisioned during traffic spikes and remain after traffic returns to normal. The connection count metric per replica identifies them: a replica with fewer than 10 connections over 30 days is receiving no meaningful query traffic and should be evaluated for termination. Most accounts with large database fleets have at least two or three of these sitting idle and costing full instance rates.

Where the Remaining 39% Comes From

The remaining 39% is distributed across 11 categories. The largest are compute right-sizing (9%), in-memory cache optimization (7%), and data transfer charge reduction (6%). The remaining 17% covers orphaned network resources, unused load balancers, oversized serverless function memory allocations, underutilized network gateways, unused DNS zones, stale infrastructure stack resources, and commitment plan optimization.

Compute right-sizing representing only 9% of total identified savings was the finding most at odds with common assumptions in the FinOps community, where right-sizing is often treated as the primary cost optimization lever. Our interpretation: compute right-sizing is harder to implement — it requires multi-dimensional analysis, headroom calibration, and engineering approval — and is therefore less likely to be complete even in accounts that have done prior cost optimization work. The lower percentage may partly reflect that teams prioritized the easier categories first and left right-sizing partially complete.

What the Teams With the Smallest Gaps Did Differently

Across the 50 accounts, 8 had identified savings below 15% of their total compute spend. These accounts shared three practices the higher-waste accounts did not have.

First, they had an explicit cost review cadence: a monthly 1-hour meeting where someone — FinOps or platform engineering — reviewed the top 5 cost anomalies and assigned owners. Not a weekly review. Not a quarterly one. Monthly, with assigned owners, consistent attendance.

Second, they had resource tagging coverage above 80% with explicit team-level attribution. Not project-level. Not environment-level. Team-level, so that a recommendation could be routed to a specific person with context about why their account looks the way it does.

Third, they had already implemented non-production scheduling and were running it consistently. The scheduling practice was established before they came to us, which meant the largest single savings category was already captured.

None of the 8 low-waste accounts used a best-in-class optimization platform — most were using a combination of native provider tools with manual review. The difference was organizational: assigned ownership, regular cadence, and years of accumulated incremental cleanup. The highest-waste accounts were not using worse tools. They had not institutionalized the review process.

What This Means for Prioritization

For an engineering team starting a cost optimization program with no previous work, the data from these 50 accounts suggests a clear starting sequence.

First, implement non-production scheduling for all identifiable development, staging, and QA environments. This captures approximately 28% of available savings, carries low implementation risk, and is achievable in 2-4 weeks with proper dependency analysis upfront.

Second, run the orphaned snapshot audit and delete confirmed orphans after classification review. This captures approximately 19% of available savings, carries zero production risk, and can be completed in a single day with proper tooling.

Third, review managed database configurations for unnecessary high-availability mode and zero-traffic read replicas. This captures approximately 14% of available savings, carries moderate implementation risk, and is achievable in 1-2 weeks per account with appropriate change management.

These three categories, executed in sequence, typically capture over 60% of the available savings in an account and build the organizational muscle for ongoing cost optimization. Compute right-sizing, commitment optimization, and architecture-level data transfer reduction can follow as the team's capacity and tooling mature. The order matters. Start with the categories where the analysis is unambiguous and the implementation is low-risk. Save the nuanced work for when the team has built confidence in the process.

Find out what your account looks like in 20 minutes

KernelRun analyzes all three primary waste categories within 20 minutes of connecting your first account. Average first-cycle savings identified: 34% of compute spend. Connect in 4 minutes.

Request a Demo