DevOps cloud cost optimization… is there such a thing? After all, if you’re concerned with your software’s development and operations, you want to make sure things work and work quickly. In dozens of companies we’ve spoken with, infrastructure cost is an afterthought.
Until it’s not.
Here’s what happens: someone in Finance, or the CTO, or the CIO, takes a look at the line-item expenses for DevOps, and realizes just how much of the budget is eaten up by cloud costs. All of a sudden, DevOps folks are facing top-down directives to reduce cloud costs, and need to find ways to do so without interrupting their “regular” work.
This is a common scenario. In 2016, enterprises spent $23B on public cloud IaaS services. By 2020, that figure is expected to reach $65B. Wasted spend makes up a quarter or more of that spend, much of it form of services running when they don’t need to be, improperly sized infrastructure, orphaned resources, and shadow IT.
DevOps teams: this is a problem you can get in front of. In fact, you can even apply some of the core tenets of DevOps to reducing cloud waste, including holistic thinking, eliminating silos, rapid feedback, and automation.
Our Director of Cloud Solutions, Chris Parlette, heard these problems from cloud users and put together a presentation on a DevOps cloud cost optimization approach. Watch it on demand now and learn how you can get started: How to Eliminate Cloud Waste with a Holistic DevOps Strategy.
Plus, check out these related resources:
Today, we propose a new concept to add to the DevOps mindset: Continuous Cost Control.
In DevOps, speed and continuity are king. Continuous Operations, Continuous Delivery, Continuous Integration. Keep everything running and get new features in the hands of users quickly.
For some organizations, this approach leads to a mindset of “speed at any cost”. Especially in the era of easily consumable public cloud, this results in a habit of wasted spend and blown budgets – which may, of course, meet the goals for delivery. But remember that a goal of Continuous Delivery is sustainability. This applies to the coding and backend of the application, but also to the business side.
With that in mind, we get to the cost of development and operations. At some point in every organization’s lifecycle comes the need to control costs. Perhaps it’s when your system or product reaches a certain level of predictability or maturity – i.e. maintenance mode – or perhaps earlier, depending your organization.
We all know that agility has helped companies create competitive advantage; but customers and others tell us it can’t be “agility at any cost.” That’s why we believe the next challenge is cost-effective agility. That’s what Continuous Cost Control is all about.
What is Continuous Cost Control?
Think of it as the ability to see and automatically take action on development and operations resources, so that the amount spent is a controlled factor and not merely a result. This should occur with no impact to delivery.
Think of the spend your department manages. It likely includes software license costs and true-ups and perhaps various service costs. If you’re using private cloud/on-premise infrastructure, you’ve got equipment purchases and depreciations, plus everything to support that equipment, down to the fuel costs for backup generators, to consider.
However, the second biggest line item (after personnel) for many agile teams is public cloud. Within this bucket, consider the compute costs, bandwidth costs, database costs, storage, transactions… and the list goes on.
While private cloud/on-premise infrastructure requires continuous monitoring and cost control, the problem becomes acute when you change to the utility model of the public cloud. Now, more and more people in your organization have the ability to spin up virtual servers. It can be easy to forget that every hour (or minute, depending on the cloud provider) of this compute time costs money – not to mention all the surrounding costs.
Continually controlling these costs means automating your cost savings at all points in the development pipeline. Early in the process, development and test systems should only be run while actually in use. Later, during testing and staging, systems should be automatically turned on for specific tests, then shut down once the tests are complete. During maintenance and production support, make sure your metrics and logs keep you updated on what is being used – and when.
How to get started with Continuous Cost Control
While Continuous Cost Control is an idea that you should apply to your development and operations practices throughout all project phases, there are a few things you can do to start a cultural behavior of controlled costs.
- Create a mindset. Apply principles of DevOps to cloud cost control.
- Take a few “easy wins” to automate cost control on your public cloud resources.
- Schedule your non-production resources to turn off when not needed
- Build in a process to “right size” your instances, so you’re not paying for more capacity than you need
- Use alternate services besides the basic compute services where applicable. In AWS, for example, this includes Auto Scaling groups, Spot Instances, and Reserved Instances
- Integrate cost control into your continuous delivery process. The public cloud is a utility which needs to optimized from day one – or if not then, as soon as possible.
- Analyze usage patterns of your development team to apply rational schedules to your systems to increase adoption rates
- Allow deviations from the normal schedules, but make sure your systems revert back to the schedule when possible
- Be honest about what is being used, and don’t just leave it up for convenience
We hope this concept of Continuous Cost Control is useful to you and your organization – and we welcome your feedback.
DevOps cloud cost control: an oxymoron? If you’re in DevOps, you may not think that cloud cost is your concern. When asked what your primary concern is, you might say speed of delivery, or integrations, or automation. However, if you’re using public cloud, cost should be on your list of problems to control.
The Cloud Waste Problem
If DevOps is the biggest change in IT process in decades, then renting infrastructure on demand is the most disruptive change in IT operations. With the switch from traditional datacenters to public cloud, infrastructure is now used like a utility. Like any utility, there is waste. (Think: leaving the lights on or your air conditioner running when you’re not home.)
How big is the problem? In 2016, enterprises spent $23B on public cloud IaaS services. We estimate that about $6B of that was wasted on unneeded resources. The excess expense known as “cloud waste” comprises several interrelated problems: services running when they don’t need to be, improperly sized infrastructure, orphaned resources, and shadow IT.
Everyone who uses AWS, Azure, and Google Cloud Platform is either already feeling the pressure — or soon will be — to reel in this waste. As DevOps teams are primary cloud users in many companies, DevOps cloud cost control processes become a priority.
4 Principles of DevOps Cloud Cost Control
Let’s put this idea of cloud waste in the framework of some of the core principles of DevOps. Here are four key DevOps principles, applied to cloud cost control:
1. Holistic Thinking
In DevOps, you cannot simply focus on your own favorite corner of the world, or any one piece of a project in a vacuum. You must think about your environment as a whole.
For one thing, this means that, as mentioned above, cost does become your concern. Businesses have budgets. Technology teams have budgets. And, whether you care or not, that means DevOps has a budget it needs to stay within. Whether it’s a concern upfront or doesn’t become one until you’re approached by your CTO or CFO, at some point, infrastructure cost is going to be under scrutiny – and if you go too far out of budget, under direct mandates for reduction.
Solving problems not only speedily and elegantly, but cost efficiently becomes a necessity. You can’t just be concerned about Dev and Ops, you need to think about BizDevOps.
Holistic thinking also means that you need to think about ways to solve problems outside of code… more on this below.
2. No Silos
The principle of “no silos” means not only no communication silos, but also, no silos of access. This applies to the problem of cloud cost control when it comes to issues like leaving compute instances running when they’re not needed. If only one person in your organization has the ability to turn instances on and off, then all responsibility to turn those instances off falls on his or her shoulders.
It also means that if you want to use an instance that is scheduled to be turned off… well, too bad. You either call the person with the keys to log in and turn your instance on, or you wait until it’s scheduled to come on. Or if you really need a test environment now, you spin up new instances – completely defeating the purpose of turning the original instances off.
The solution is eliminating the control silo by allowing users to access their own instances to turn them on when they need them and off when they don’t — of course, using governance via user roles and policies to ensure that cost control tactics remain uninhibited.
(In this case, we’re thinking of providing access to outside management tools like the one we provide, but this can apply to your public cloud accounts and other development infrastructure management portals as well.)
3. Rapid, Useful Feedback
In the case of eliminating cloud waste, the feedback you need is where, in fact, waste is occurring. Are your instances sized properly? Are they running when they don’t need to be? Are there orphaned resources chugging away, eating at your budget?
Useful feedback can also come in the form of total cost savings, percentages of time your instances were shut down over the past month, and overall coverage of your cost optimization efforts. Reporting on what is working for your environment helps you decide how to continually address the problem that you are working on next.
You need monitoring tools in place in order to discover the answers to these questions. Preferably, you should be able to see all of your resources in a single dashboard, to ensure that none of these budget-eaters slip through the cracks. Multi-cloud and multi-region environments make this even more important.
The principle of Automation means that you should not waste time creating solutions when you don’t have to. This relates back to the problem of solving problems outside of code mentioned above.
Also, when “whipping up a quick script”, always remember the time cost to maintain such a solution. More about why scripting isn’t always the answer.
So when automating, keep your eyes open and do your research. If there’s already an existing tool that does what you’re trying to code, it could be a potential time-saver and process-simplifier.
So take a look at your DevOps processes today, and see how you can incorporate a DevOps cloud cost control – or perhaps, “continuous cost control” – mindset to help with your continuous integration and continuous delivery pipelines. Automate cost control to reduce your cloud expenses and make your life easier.
“Is that old cloud instance running?”
Perhaps you’ve heard this around the office. It shouldn’t be too surprising: anyone who’s ever tried to load the Amazon EC2 console has quickly found how difficult it is to keep a handle on everything that is running. Only one region gets displayed at a time, which makes it common for admins to be surprised when the bill comes at the end of the month. In today’s distributed world, it not only makes sense for different instances to be running in different geographical regions, but it’s encouraged from an availability perspective.
On top of this multi-region setup, many organizations are moving to a multi-cloud strategy as well. Many executives are stressing to their operations teams that it’s important to run systems in both Azure and AWS. This provides extreme levels of reliability, but also complicates the day-to-day management of cloud instances.
So is that old cloud instance running?
You may get a chuckle out of the idea that IT administrators can lose servers, but it happens more frequently than we like to admit. If you only ever log in to US-East1, then you might forget that your dev team that lives in San Francisco was using US-West2 as their main development environment. Or perhaps you set up a second cloud environment to make sure your apps all work properly, but forgot to shut them down prior to going back to your main cloud.
That’s where a single-view dashboard (like the view you get with ParkMyCloud) can provide administrators with unprecedented visibility into their cloud accounts. This is a huge benefit that leads to cost savings right off the bat, as the cloud servers running that you forgot about or thought you turned off can be seen in a single pane of glass. Knowledge is power: now that you know it exists, you can turn it off. You also get an easy view into how your environment changes over time, so you’ll be aware if instances get spun up in various regions.
This level of visibility also has a freeing effect, as it can lead you to utilizing more regions without fear of losing instances. Many folks know they should be distributed geographically, but don’t want to deal with the headache of keeping track of the sprawl. By tracking all of your regions and accounts in one easy-to-use view, you can start to fully benefit from cloud computing without wasting money on unused resources.
Now with ParkMyCloud’s core functionality available for free, it’s easy to get this single view of your AWS and Azure environments. We think you’ll get a new perspective on your existing cloud infrastructure – and maybe you’ll find a few lost servers! Get started with the free version of ParkMyCloud.
Waste not, want not. That was one of the well-healed quips of one the United States’ Founding Fathers, Benjamin Franklin. It couldn’t be more timely advice in today’s cloud computing world – the world of cloud waste. (When he was experimenting with static electricity and lightning, I wonder if he saw the future of Cloud? :^) )
Organizations are moving to the Cloud in droves. And why not? The shift from CapEx to monthly OpEx, the elasticity, the reduced deployment times and faster time-to-market: what’s not to love?
The good news: the public cloud providers have made it easy to deploy their services. The bad news: the public cloud providers have made it easy to deploy their services…really easy.
And, experience over the past decade has shown that leads to cloud waste. What is “cloud waste” and where does it come from? What are the consequences? What can you do to reduce it?
What is Cloud Waste?
“Cloud waste” occurs when you consume more cloud resources than you actually need to run your business.
It takes several forms:
- Resources left running 24×7 in development, test, demo, and training environments where they don’t need to be running 24×7. (Thoughts of parents yelling at children to “turn the lights out” if they are the last one in a room.) I believe this is bad habit that was reinforced by the previous era of on premise data centers. The thinking: It’s a sunk cost any, why bother turning it off? Of course, it’s not a sunk cost anymore.
This manifests itself in various ways:
- Instances or VMs which are left running, chewing up $/CPU-Hr costs and network charges
- Orphaned volumes (volumes not attached to any servers), which are not being used and incurring monthly $/GB charges
- Old snapshots of those or other volumes
- Old, out-of-date machine images
However, cloud consumers are not the only ones to blame. The public cloud providers are also responsible when it comes to their PaaS (platform as a service) offerings for which there is no OFF switch (e.g., AWS’ RDS, Redshift, DynamoDB and others). If you deliver a PaaS offering, make sure it has an OFF switch.
- Resources that are larger than needed to do the job. Many developers don’t know what size instance to spin up to do their development work, so they will often spin up larger ones. (Hey, if 1 core and 4 GB of RAM is good, then 16 cores and 64 GB of RAM must be even better, right?) I think this habit also arose in the previous era of on-premise data centers: “We already paid for all this capacity anyway, so why not use it?” (Wrong again.)
This, too, rears its ugly head in several ways:
- Instances or VMs which are much larger than they need to be
- Block volumes which are larger than they need to be
- Databases which are way over-provisioned compared to what their actual IOPS or sequential throughput requirements actually are.
Who is Affected by Cloud Waste?
The consequences of cloud waste are quite apparent. It is killing everyone’s business bottom line. For consumers, it erodes their return on assets, return on equity and net revenue. All of these ultimately impact earnings per share for their investors as well.
Believe it or not, it also hurts the public cloud providers and their bottom line. Public cloud providers are most profitable when they can oversubscribe their data centers. Cloud waste forces them to build more, very expensive data centers than they need to, killing their oversubscription rates and hurting their profitability as well. This is why you see cloud providers offering certain types of cost cutting solutions. For example, AWS offers Reserved Instances, where you can pay up front for break in on-demand pricing. They also offer Spot Instances, Auto-Scaling Groups and Lambda. Azure also offers price breaks to their ELA customer and Scale Sets (the equivalent of ASGs).
How to Prevent Cloud Waste
So, what can you do to address this? Ultimately, the solution to this problem exists between your ears. Most of it is common sense: It requires rethinking… rewiring your brain to look at cloud computing in a different way. We all need to become honorary Scotsmen (short arms and deep pockets… with apologies to my Scottish friends).
- When you turn on resources in non-production environments, turn on the minimum size needed to get the job done and only grudgingly move up to the next size.
- Turn stuff off in non-production environments, when you are not using it. And for Pete’s sake, when it comes to compute time, don’t waste your time and money writing your own scripts…that just exacerbates the waste. Those DevOps people should spend that time on your bread and butter applications. Use ParkMyCloud instead! (Okay, yes, that was a shameless plug, but it is true.)
- Clean up old volumes, snapshots and machine images.
- Buy Reserved Instances for your production environments, but make sure you manage them closely, so that they actually match what your users are provisioning, otherwise you could be double paying.
- Investigate Spot fleets for your production batch workloads that run at night. It could save you a bundle.
These good habits, over time, can benefit everyone economically: Cloud consumers and cloud producers alike.
Traditional IT companies may dominate in a few fields, but in others, they will never catch up to those companies “born in the cloud.”
I actually have a unique perspective on these two worlds, as prior to this adventure at ParkMyCloud, I worked at IBM for many years. I was originally with Micromuse, where we had a fault and service assurance solution (Netcool) to manage and optimize Network and IT Operations. Micromuse was acquired by IBM in 2006 by the Tivoli Software Group business unit (later to be named Smarter Cloud). IBM was great – I learned a lot and met a lot of very smart, bright people. I was in Worldwide Sales Management so I had visibility across the globe into IT trends.
In the 2012/2013 timeframe, I noticed we were losing a lot of IT management, monitoring and assurance deals to companies like ServiceNow, New Relic, Splunk, Microsoft, and the like – all these “born in cloud” companies offering SaaS-based solutions to solve complex enterprise problems (that is, “born in the cloud” other than Microsoft – I’ll come back to them).
At first these SaaS-based IT infrastructure management companies were managing traditional on-premise servers and networks, but as more and more companies moved their infrastructure into the cloud, the SaaS companies were positioned to manage that as well – but at IBM, we were not. All of the sudden we were trying to sell complex, expensive IT management solutions for stuff running in this “cloud” called Amazon Web Services (AWS) – a mere 5 years ago. And then Softlayer, Rackspace, and Microsoft Azure popped up. I start thinking, there must be something here, but what is it and who’s going to manage and optimize this infrastructure?
After a few years sitting on the SaaS side of the table, now I know. Many meetings and discussions with very large Fortune 100 enterprises have taught me several very salient points about the cloud:
- Public cloud is here to stay – see Capital One or McDonald’s at recent AWS re:Invent Keynotes (both customers of ParkMyCloud, by the way)
- Enterprises are NOT using “traditional” IT tools to build, test, run and manage infrastructure and applications in the cloud
- What’s different about the cloud is that it’s a YUGE utility, which means companies now focus on cost control. Since it’s an OpEx model rather than a CapEx model they want to continually optimize their spend
Agility and innovation drive public cloud adoption but as cloud maturity grows so does the need for optimization – governance, cost control, and analytics.
So where does this leave the traditional companies like Oracle, HPE, and IBM? How are they involved in the migration to and lifecycle management of cloud-based applications? Well, from what I have seen they on the outside looking in – which is why when my good friend sent this to me the other day I was shocked – I guess Oracle decided to spot AWS a $13B lead – pretty smart, I am sure they will make this gap up by oh, let’s say 2052… brilliant strategy.
That said, one company that “gets it” seems to be Microsoft, both in terms of providing cloud infrastructure (Azure) but also being progressive enough to license their technologies for even the smallest of companies to adopt and grow using their applications.
To put a bow on this point, I was at a recent meeting where a Fortune 25 company was talking to us about their migration into the cloud, and the tools they are using:
- Clouds – AWS / Azure
- Migration – service partner
- Monitoring – DataDog
- Service Desk and CMDB – ServiceNow
- Application Management – NewRelic
- Log analytics – Splunk
- Pipeline automation – Jenkins
- Cost control (yes, that’s a category now) – ParkMyCloud
Now that’s some pretty good company! And not a single “traditional” IT tool on the list. I guess it takes one born in the cloud to manage it.