← Back to Blog

Anatomy of a Cloud Bill Spike: Three Root Causes We See Repeatedly

Article illustration

When an engineering team notices their cloud bill jumped 40% month-over-month without a corresponding traffic increase, the first assumption is usually some misconfigured network gateway. Those data processing charges are notoriously opaque and do show up as unexpected line items. But in KernelRun's cost anomaly detection data across 50+ customer accounts, gateway surprises represent only about 18% of spike events. The other 82% fall into three categories that are less frequently discussed and harder to catch with standard budget alerts.

Understanding these three categories matters because the remediation approach differs for each one — and so does the correct detection method. A static budget threshold catches all three eventually, after the damage is done. A multi-variate baseline model with per-service anomaly detection catches them within hours.

Root Cause 1: Spot Capacity Interruption Cascades

Discounted spot-priced compute instances can run 60-90% cheaper than standard on-demand rates, and workloads that can tolerate interruption should be using them. The problem occurs when a spot fleet is interrupted and the auto-scaling group's fallback configuration switches to full-price instances without a corresponding notification or capacity review.

The typical pattern: a spot fleet covering 60% of a batch processing cluster gets interrupted during a capacity crunch. The auto-scaling group successfully replaces those interrupted instances with full-price equivalents, as designed. The workload continues without any user-visible impact. The engineering team sees no alert because the infrastructure behaved correctly from an availability standpoint. What they miss is that their on-demand instance count has tripled, and the spot fleet may not recover to discounted capacity for days.

We've seen this pattern generate $18,000 in unexpected monthly charges for a team running a large data processing cluster. The full-price fallback ran for 11 days before anyone noticed, because the budget alert threshold was set at the monthly level and the overage only became visible near the end of the month.

The detection method that works: monitor full-price instance count per auto-scaling group in real time, not per billing cycle. A 24-hour period where the on-demand percentage for a historically spot-heavy group exceeds a threshold — we default to 30% deviation from the 30-day baseline — triggers an immediate alert. This catches the cascade within hours rather than weeks.

Root Cause 2: Snapshot Accumulation from Automated Backups

Persistent volume snapshots are incremental after the first full copy, but they are not free. For a database with 1TB of storage running a daily backup policy with 30-day retention, snapshot costs are predictable and budgeted. The problem occurs when retention policies are not enforced, instances are terminated without snapshot cleanup, or automated backup tools create additional snapshots outside the primary retention policy.

In practice, snapshot proliferation happens because multiple tools often manage backups for the same instance. A platform backup service runs its retention policy. The operations team has a manual snapshot script from three years ago that predates that service's adoption. A third-party DR tool also creates snapshots. Nobody owns the coordination between these three systems, and the snapshots accumulate.

One customer's account had 847 snapshots for database instances that had been terminated. No retention policy covered terminated-instance snapshots because the policies were instance-attached, and the instance was gone. The orphaned snapshots were accumulating at standard storage rates with no visibility in the cost dashboard because snapshot costs appear as a single storage line item, not broken out per snapshot or per source instance.

The detection approach is straightforward but requires dedicated analysis: enumerate all volume snapshots, cross-reference source instance IDs against currently running instances, and flag snapshots whose source no longer exists. For most accounts running more than two years, the first pass of this analysis identifies between $200-$1,200/month in orphaned snapshot charges.

Root Cause 3: Data Transfer Charges from Architecture Changes

Cloud data transfer pricing is structured to charge for traffic crossing availability zones, regions, and the public internet. Most engineering teams understand the high-level structure but miss the specific patterns that create large bills. The spike trigger is almost always an architecture change that inadvertently introduces cross-zone traffic where none existed before.

The most common variant: a team migrates an application from standard compute instances to containerized tasks. The new containers get assigned to availability zones based on provider capacity, not based on the zone placement of the services they communicate with. An application that previously ran co-located within the same zone now has containers in zone A making API calls to a managed database in zone B. Every request crosses a zone boundary at per-GB rates in each direction.

For a high-throughput service making 10 million API calls per day averaging 5KB of response payload, cross-zone data transfer costs approximately $900/month that didn't exist before the migration. The application works perfectly. No alerts fire. The cost shows up as an uptick in data transfer charges with no obvious link to the container migration.

As we cover in our article on tag inference and cost data gaps, the challenge is that data transfer charges are attributed to the source service but not to the architectural decision that generated them. Correlating a billing spike with a specific deployment requires cross-referencing deployment timestamps with billing data, which most teams haven't set up as a standard practice.

Why Standard Budget Alerts Miss These Patterns

Platform budget alerts support threshold-based notifications: alert when spend exceeds X% of the monthly budget. This is a necessary baseline, but it has two structural limitations for the three root causes above.

First, monthly budget alerts operate on a lag. If a spot fleet cascade happens on day 3 of the month, the budget alert won't fire until the accumulated overage reaches the threshold — which for typical settings happens around day 20. Three weeks of unexpected overspend has already occurred.

Second, budget alerts operate on total spend, not on spend anomalies relative to a per-service baseline. A team whose cloud bill is growing 15% month-over-month due to legitimate business growth will set their threshold at 20% above current spend. An anomalous spike of 12% won't trigger the alert even though it represents $8,000 of unexpected spend from an identifiable root cause.

What Multi-Variate Baseline Detection Actually Looks Like

Cost anomaly detection with per-service baselines solves both problems. KernelRun's anomaly engine builds a multi-variate baseline per service using 90 days of history and flags deviations within 4 hours. The first flag is a warning (3-sigma deviation); the second is an alert requiring immediate investigation (5-sigma).

For teams that want to build detection capability without a dedicated platform, the minimum viable approach: configure separate cost monitors for compute, storage, and data transfer (not a single monitor for all services), and set alert thresholds based on dollar impact rather than percentage. A 50% increase in data transfer costs is more actionable than a 0.5% increase in total spend.

The three root causes described here account for roughly 42% of the spike events we observe — and all three are preventable with the right detection infrastructure. The gap between "budget alert" and "anomaly detection" is where most teams are losing real money every month.

Catch cost anomalies within 4 hours

KernelRun builds a multi-variate baseline per service and flags deviations the same day they occur. Connect your first cloud account in 4 minutes.

Request a Demo