Most rightsizing recommendations are wrong. Not because the analysis is bad, but because the baseline data feeding the analysis is inadequate. A sizing tool that looks at 14 days of CPU utilization and recommends downsizing to the p95 of that window will get you burned the first time a quarterly batch job or end-of-month reporting run hits the undersized instance.
A 90-day utilization baseline is the minimum viable window for making confident sizing decisions on production workloads. Here's how to build one that holds up, and what to do when your monitoring data doesn't go back that far.
Why 90 Days Specifically
The argument for 90 days is straightforward: it captures weekly periodicity (business days vs weekends), monthly periodicity (month-end processing, billing cycles, payroll runs), and the tail of quarterly patterns. A 30-day window might capture one end-of-month spike. A 90-day window captures three, which is enough to establish a reliable p99 utilization floor.
The argument against 90 days is that the workload may have changed significantly in the last quarter. Fair point for high-growth services. The resolution: segment the baseline. Look at the 90-day trend for the p95 utilization metric specifically, and compare the last 30 days to the preceding 60. If there's a meaningful upward trend in the recent 30 days, the 90-day p95 is too conservative. If there's no trend, 90 days is your baseline.
For non-production environments — development, staging, integration testing — 30 days is usually sufficient, because these environments don't have the same periodic business patterns. Rightsizing dev environments aggressively is low-risk and high-reward.
Which Metrics Actually Matter
Platform-native monitoring typically provides CPU utilization, network I/O, and disk I/O. For most compute instances, CPU is the primary sizing dimension. But "CPU utilization" is underspecified — you need to know which statistic you're looking at.
Average CPU is nearly useless for sizing. A workload that runs at 5% CPU average with 15-minute bursts to 95% will look identical in average utilization to a workload that runs at 5% CPU continuously. The first workload needs a bigger instance than the average suggests; the second is a candidate for downsizing.
The metrics that matter for compute sizing:
- p99 CPU utilization over the trailing 90 days — this is your headroom floor. Your target is for p99 to be below 70% on the sized instance.
- Maximum CPU utilization over the trailing 90 days — this catches the absolute peak. You want the sized instance to handle this without triggering throttling.
- CPU credit balance (for burstable instance types) — a chronically depleted credit balance is a signal that the instance is undersized for its actual burst pattern.
- Memory utilization — not provided natively by most cloud platforms for compute instances. Requires an agent. Worth installing if you're making memory-intensive sizing decisions.
For managed databases, add these:
- DatabaseConnections — scaling down connection limits when you downsize can cause application errors under load.
- ReadLatency and WriteLatency — latency spikes are often the first signal of an undersized database before CPU becomes visible.
- FreeStorageSpace — relevant when considering storage class changes alongside instance sizing.
Building the Baseline Without 90 Days of History
Here's a common situation: you've instrumented a new environment two weeks ago, or your monitoring retention is set to 30 days and you want to build a 90-day baseline without paying for extended retention. A few approaches work:
Synthetic extension using traffic proxies. If you have 90 days of application request logs (which most teams retain in object storage), you can reconstruct a proxy for utilization by correlating request volume with the CPU utilization you do have. If the last 30 days shows that 1,000 requests/minute corresponds to 40% CPU, and 90-day request logs show a month-end peak of 3,200 requests/minute, you can infer that 3,200 rpm would have generated roughly 128% CPU — meaning you need a larger instance tier than 30-day data suggests.
This isn't perfect. The relationship between request volume and CPU isn't always linear, especially for database-heavy operations. But it's far better than extrapolating from 14 days of history.
Conservative buffer application. If you have no proxy data, apply a conservative headroom buffer to the 30-day p99. For production workloads with unknown seasonal patterns, we recommend a 40% headroom buffer rather than the standard 20%. You're paying for ignorance, but you're not paying with downtime.
Staged downsizing with monitoring gates. Instead of committing to a new instance size, stage the change: run the candidate instance size in shadow mode if your architecture supports it, or resize during a low-traffic window with an automatic rollback trigger if p95 CPU exceeds 80% within 24 hours. This turns a risky one-shot decision into a monitored experiment.
What to Do With the Baseline Once You Have It
The baseline answers one question: what is the actual utilization range of this resource? The sizing decision requires two additional inputs: what headroom do you want, and what's the cost difference between current and candidate sizes?
A reasonable standard for production rightsizing: target p99 CPU below 65%, with enough headroom that a 2x traffic spike doesn't take the instance past 95%. For most workloads, this translates to sizing to the p99 utilization multiplied by 1.5.
The final step is calculating the break-even on the analysis itself. If a rightsizing decision saves $80/month and took 4 hours of engineering time to research, that's a 2-month payback period — worth doing. If it saves $8/month, the opportunity cost of the analysis exceeds the savings, and you should either automate the analysis or deprioritize it.
KernelRun's utilization analysis builds 90-day baselines automatically as soon as you connect an account, surfaces the p95/p99 statistics alongside the current instance size and the cost of the next tier down, and pre-calculates the monthly savings for each rightsizing opportunity. The goal is to make the analysis cost essentially zero so you can evaluate every opportunity, not just the obvious ones.
Get your 90-day baseline in minutes
KernelRun builds utilization baselines automatically and surfaces rightsizing opportunities with pre-calculated savings estimates. No manual analysis required.
Request a Demo