Neglected servers and busy engineers

NISARG SHAH
10 min readJan 28, 2024

TL,DR; I saved about a hundred thousand ($) annually through resource optimization. I outline some of my learnings below. I place a high value in learning about wide-range AWS offerings and developing effective governance.

About a year ago, I had an interesting conversation with a DevOps engineer. He wanted to know when he could migrate some EC2 instances and an ElastiCache cluster to AWS Graviton. It was the most curious thing, because I had never seen that ElastiCache cluster, nor seen it in any design diagrams.

After the meeting, I discovered that the cluster was costing us several thousand dollars annually, but it was barely being used. It stored about 5 MB of data. I found out that this cluster was designed to serve a huge number of requests, but the expected load never materialized. This resource had been leaking money for months. We agreed to scale it down to the lowest ElastiCache SKU to limit costs until the demand reached higher levels.

This incident made me wonder if we had more resources that had similar optimization opportunities. With help from colleagues and other leaders, I identified other areas on AWS that were “leaking money.” Here is a summary of some common optimization opportunities I found across several products and their AWS accounts:

Unused resources

We found unused resources everywhere. Every account had something that everyone had forgotten about. Maybe it was an open search cluster that was created during a research project, a EC2 instances created for pen test environment, a set of databases to test some upgrades, or a large instance someone created to test performance of their application. People always forget things.

Bad memory comes with very high costs.

ElastiCache

The biggest mistakes I found with ElastiCache were related to size and high availability configurations in non-prod environments. Most designers maintained identical prod and non-prod configurations. For example, if prod cluster had 7 GB RAM with 3 replicas, non-prod had the same configuration. But usually, the non-prod environments cached significantly less data and used the cluster less frequently. It was perfectly fine to reduce replicas down to zero in non-prod. We also decided to reduce the size of some non-prod environments due to lower usage.

EBS

I almost always found EC2 instances that someone created temporarily to test something. Many people stop their instances once they are done, in order to stop incurring costs. But most forget that the EBS volumes remain active even after stopping the instances. Sometimes these instances had been inactive for years, as the people who created them had left the team or the company. At other times, someone wanted to retain the data in the instances with the hope that they may be required someday. So, we either removed the instances (with volumes) or we removed them after taking EBS snapshots.

I found another subtle gotcha with EBS migration from gp2 to gp3. The team that performed the migration always took EBS snapshots before migrating, but they never removed the snapshots. So, they saved $ 0.02 / GB by migrating to gp3, but they spent an additional $ 0.05 / GB on the EBS backups. So, while trying to save 20% of EBS costs, they effectively increased the costs by 30%. 🤯

Often it is necessary to test your hypotheses even after making the changes. It is very important to look at costs after you think you have reduced costs.

Storage tiers

An interesting idea that we explored at one point was to move EBS backups into archived tiers. But the fine print on the documentation said that hot tier snapshots are incremental, but those in the archive tier are independent. We strongly suspected that if we moved the daily EBS snapshots into archive tier the costs would increase 10x instead of reducing 5x. We eventually decided to just schedule automatic deletion of some old snapshots.

RDS

It is very difficult to identify whether an RDS instance is right-sized. Databases usually try to use up as much RAM as they can. So, you would usually see very low CPU utilization, but the RAM would be at 90%. This would lead many to believe that if they reduced the size of the database, it would result in massive performance impact. This is flawed logic for two reasons:

  1. RAM is primarily used in relational databases to cache data. So, RAM has less of an impact on performance compared to CPU or disk performance.
  2. Claims without measurements to back them are just guesses.

I always found it challenging to convince someone to scale down their databases based on utilization metrics due to the RAM red-herring. However, I still found some interesting cases. Once I found that a DB with <10 GB storage was provisioned with 128 GB of RAM (costing tens of thousands). In a few other instances I found database replicas, but no application was provided the connection strings to the replicas! It is usually easy to validate such configuration issues through the connection or session counts.

We also implemented scheduled shutdown of non-prod DB clusters during non-work hours, saving around 30–40% of their costs.

The simplest way to prove that something is sized correctly in cloud is to scale it down a bit and look at health or performance data.

Aurora Serverless

During some of this work, I found a very useful feature in aurora serverless. It will helpfully stop the compute resources if your DB remains idle for a short period. We found some services that were not actively being tested in non-production. As a result, their databases were not receiving any queries for weeks except during regression testing. Simply letting AWS stop these resources saved thousands of dollars.

Non-production resources don’t have to be active when they are not in use.

DynamoDB

Oddly, I found that most DynamoDB tables were configured with 5 RCU & WCU. Many others were configured with 20 RCU & WCU. That struck me as odd. I expected read consumption to be much higher than write. I later found out that the tables having 5 RCU were created from console without changing the defaults and the tables having 20 RCU were mostly created using CloudFormation where someone could have potentially copied the same template to create many tables.

The fix was simple. In production, we looked at the actual usage and changed the minimum provisioned RCU/WCU to the average consumption and allowed them to scale 10x in response to load. In non-production, we mostly had to switch all tables to on-demand scaling as they rarely had sustained usage.

Application Load Balancers

I found some easy improvements with load balancers early on. I found a few load balancers that had no rules, a few that had no targets and a few that had no healthy targets because the instances were stopped. We promptly removed these load balancers.

Quite some time later, I realized that multiple application load balancers in one VNet was an anti-pattern. The application load balancer provides the ability to route requests based on hostnames and path. It is trivial to put multiple APIs behind one load balancer. By provisioning separate load balancers for each API, we were paying a flat fee of about $ 20 / month that was entirely avoidable. However, some of these ideas remained in backlog due to DevOps effort required for configuration changes.

NAT gateways

NAT gateways are quite similar to application load balancers. They cost nearly the same and they have a similar anti-pattern. If there are multiple NAT gateways in one VPC, it is likely to be an anti-pattern. The NAT gateways have a flat hourly fee (like the load balancers) and a fee based on data transmitted. These resources are scaled by AWS in response to traffic, so there is really no value in creating 5 NAT gateways for 5 applications inside one VPC.

I found a neat trick with resource manager which allowed sharing a VPC across accounts. Then you could create a NAT gateway in one account, allowing resources in multiple accounts to benefit from the NAT gateway without having to provision their own gateways. But again, most of these ideas remained in backlog due to networking changes required.

(Sometimes) Good ideas die in backlog.

Secrets Manager

The secrets manager, as the name suggests, stores secrets. Some teams start to use it to store all kinds of secrets. Like connection strings, API keys, VM credentials, encryption keys, private keys, etc. It is not an inherently a bad idea. But at some point, the secrets manager accumulates thousands of these credentials. Over time, some of these secrets are not even active or being used, especially in lower environments. At $ 0.4 / secret per month, these costs can quickly reach tens of thousands ($).

The short-term fix is to remove unused secrets, which requires some manual auditing with entire team. Sometimes the last used date in AWS helps make these decisions. The long-term fix depends on the nature of the secret and intended use of secrets manager. Often it may be perfectly fine to store the credential into parameter store, for very few secrets are configured for rotation (with RDS or thorough a custom Lambda function). But, such ideas would likely die in the backlog due to extensive code changes required.

The cost of a bad design choices (service selection) can be very high. Often, they become apparent after your applications have scaled to point where refactoring is a costly and risk endeavor.

CloudTrail

One of the most common recommendations about CloudTrail is to have an organization trail for centralized configuration and events tracking. But sometimes when accounts are onboarded into an organization already have cloud trails configured. So, when they are brought into the AWS organization they remain with 2 trails.

The pricing of cloud trail is a bit unintuitive for the unfamiliar. The first trail doesn’t incur costs, but any additional trails do. As a result, when an account is brought inside the organization, they suddenly get cloudtrail costs. In absence of active monitoring, these costs can rise to thousands ($) annually.

Lack of monitoring can also cost you.

CloudWatch

One of the puzzles that I wrestled with for a long time was GetMetricData API costs in CloudWatch. I saw these costs in several accounts, and they tend to be $ 500–1000 a month. I wondered what was actually using these metrics. Finally, with help from AWS support, I found that there was a new relic integration with some of the AWS accounts. New relic was helpfully pulling all available cloud watch metrics so that their users won’t have to go back to AWS to see those metrics. In most cases, no one in the team was aware of the integration and we promptly turned it off.

New relic (and others) don’t just incur licensing costs, but also incurs AWS costs when it pulls data from your accounts.

Trusted Advisor

I don’t trust the Trusted Advisor enough. It misleads the inexperienced engineers. Often it will provide useful hints about unused or overprovisioned resources. But it also misses a lot of signs about inefficient resource utilization. As a result, when it says that you could save $ 30k by applying reservations, many get tempted to start with that. But it is often a trap. Often the resources being proposed for reservations are unused or not used sufficiently. This happens more often in non-production than in production. But it happens frequently enough that I always caution people against starting from reservations. Needless to say, I don’t always succeed in convincing them.

Reservations should be the last step in cost savings, not the first.

The role of Governance

I found in some accounts that the costs decreased sharply after my efforts, but also rose gradually after I moved on to other work. One time cost reductions are valuable, but it is essential to build knowledge within the team and develop processes to prevent mistakes.

A colleague shared an example that I will always remember:

I would need to fill several forms and get approvals from multiple individuals to buy 10 pens. But I could spin up an EC2 that costs a thousand dollars with just few clicks (or through IaaC), and no one would notice.

I think best decisions are taken by the developers working on a given service. They can easily tell you whether their service would need 1 GB of RAM or 16 GB of RAM, because they often run it on their laptops. A step further, the ops team can still make useful decisions about whether something is being used or not, especially if these ops team members are not within the same team as developers and don’t participate in the same agile ceremonies. In my view, it is essential to develop knowledge within (or in coordination with) the development team and make decisions closer to the actual feature development (rather than in Ops).

The cloud providers make it extremely easy for the beginners to use their platforms. An inadvertent side effect is that it is equally easy to make (costly) mistakes. The leadership must develop governance and processes to prevent or catch these mistakes. I don’t expect that every IaaC change goes through a committee of reviewers, but rather emphasize on an investment in cloud skills and assigning accountability at edge (or leaf) levels in the organization.

Summary

This was just a brief summary of many insights I found while working on cost savings. While not all applications use similar patterns and not everyone makes the same kind of mistakes, I hope you found my experience interesting and insightful.

I had loads of fun hunting for inefficiencies. I hope you also had some fun reading about it. If you have had similar experiences, I would be very excited to read about them.

Through the 2023, I found ways to reduce our AWS costs by several hundreds of thousands ($) annually, but actually succeeded in getting approvals and implementing for about a hundred thousand ($). As I said, some ideas just die in backlogs.

After the AWS work was done, I moved to Azure and found some similar but many different insights. I will probably find a few more over the next few months as I am in the middle of another cost saving exercise in my current team. I will hopefully find time to write about it in future. Let me know if you are curious about that and I will prioritize writing about it.

--

--