It sounds obvious when you first say it: you can scale AWS ASGs (Auto Scaling Groups) down to zero. This can be a cost-savings measure: zero servers means zero cost. But most people do not do this!
Wait – Why Would You Want to?
Maybe you’ve heard the DevOps saying: servers should be cattle, not pets. This outlook would say that you should have no single server that is indispensable – a special “pet”. Instead, servers should be completely replaceable and important only in herd format, like cattle. One way to adhere to this framework is by creating all servers in groups.
Some of our customers follow this principle: they use Auto Scaling Groups for everything. When they create a new app, they create a new ASG – even if it has a server size of one. This can remove challenges to scale up in the future. However, this leaves these users with built-in wasted spend.
Here’s a common scenario: a production environment is built with Auto Scaling Groups of EC2 instances and RDS databases. A developer or QA specialist copies production to their testing or staging environment, and soon enough, there are three or four environments of ASGs with huge servers and databases mimicking production, all running, and costing money, when no one is using them.
By setting an on/off schedule on your Auto Scaling Groups, you can scale them down to a min/max/desired number of instances as “0” overnight, on weekends, or whenever else these groups are not needed.
In essence, this is just like parking a single EC2 instance when not in use. Even for an EC2 instance, users are unlikely to go into the AWS console at the end of a workday to turn off their non-production servers overnight. For ASGs, it’s even less likely. For a single right-click to stop an EC2 instance, an AWS ASG requires you to go to ASG settings, edit, modify the min/max/desired number of instances – and then remember to do the opposite when you need to turn them back on.
How You Can “Scale to Zero” in Practice
One ParkMyCloud customer, Workfront, is using this daily to keep costs in check. Here’s how Rob Weaver described it in a recent interview with Corey Quinn:
Scaling environments are a perfect example. If we left scaling up the entire time – 24/7 – it would cost as much as a production instance. It’s a full set of databases, application servers, everything. For that one, we’ve got it set so the QA engineers push a button [in ParkMyCloud], they start it up. For a certain amount of time before it shuts back down.
In other cases, we’ve got people who go in and use the [ParkMyCloud] UI, push the little toggle that says “turn this on”, choose how long to turn it on, and they’re done.
How else does Workfront apply ParkMyCloud’s automation to reduce costs for a 5X ROI? Find out here.
Another Fun Fact About AWS ASGs
One gripe some users have about Auto Scaling Groups is that they terminate resources when scaling down (one could argue that those users are pro-pet, anti-cattle, but I digress). If your needs require servers in AWS ASGs to be temporarily stopped instead of terminated, ParkMyCloud can do that too, with the “Suspend ASG Processes” option when parking a scale group. This will suspend the automation of an ASG and stop the servers without terminating them, and reverse this process when the ASG is being “unparked”.
Try both scaling to zero and suspending ASGs – start a free trial of ParkMyCloud to try it out.
In July, Microsoft introduced the Azure Well-Architected Framework best practices – a guide for building and delivering solutions built with Azure’s best practices. If you’ve ever seen the AWS Well-Architected Framework, Azure’s will look… familiar. It strikes many similarities with the Google Cloud Architecture Framework as well, which was released in May. This is perhaps a sign that despite the frequently argued differences between the cloud providers (and people love to compare – by far the most-read post on this blog is this one on AWS vs. Azure vs. Google Cloud market share), they are more similar than different. Is this a bad thing? We would argue, no.
There are many aspects of a well-designed architecture and these frameworks to discuss. Given ParkMyCloud’s focus on cost here, we’ll examine the cost optimization principles in Azure’s framework and how they compare to AWS and Google’s.
Architecture Guidelines at a High Level
The three cloud providers each provide architecture frameworks with similar sets of principles. AWS and Azure use the “pillar” metaphor, and in fact, the pillars are almost identically named:
While at first it is somewhat amusing to note these similarities (did Azure just ctrl+c?), it is reassuring that between the major cloud providers, all can agree what components comprise the best architecture. Better yet, they are providing ever-improving resources, training, assessments and support for their users to learn and apply these best practices.
Who Should Use the Azure Well-Architected Framework – and How to Get Started
Speaking of users – which ones are these architecture frameworks for? In their announcement, Azure noted the shifting of responsibility of security, operations, and cost management from centralized teams toward the workload owner. While the truth of this statement will depend on the organization, we have recognized this shift as well.
So while Azure’s framework is aimed largely at new Azure users and/or new applications, we would recommend every Azure user skim the table of contents and take the well-architected review assessment. The assessment takes the form of a multiple-choice “quiz”. At the end of the assessment, you are given a score and results on a scale from 1 to 100. You are also linked to next steps with detailed articles for each question where there is room for improvement. This assessment is worth the time (and won’t take much of it), giving you a straightforward action plan.
The architecture resources provided by Google Cloud are much briefer than AWS and Azure’s frameworks, and they combine performance and cost optimization into one principle, so it’s not surprising several topics are missing – including any discussion of governance or ownership of cost. AWS focuses on this the most, particularly with the new section on cloud financial management, but Azure certainly also discusses organizational structure, governance, centralization, tagging, and policies. We appreciate the stages of cost optimization Azure uses, from design, to provisioning, to monitoring, to optimizing.
All three cloud providers have similar recommendations in cost optimization regarding scalable design, using tagging for cost governance and visibility, using the most efficient resource cost models, and rightsizing.
Azure puts it this way: cost is important, but you should seek to achieve balance between all the pillars. Shoring up any of the other pillars will almost always increase costs. Invest in security first, then performance, then reliability. Operational excellence can increase or decrease costs. Cost optimization will always be important for any organization in public cloud, but it does not stand alone.
Wasted cloud spend from orphaned volumes and snapshots exacerbates the problem of cloud waste. We have previously estimated that $17.6 billion will be wasted this year due to idle and oversized resources in public cloud. Today we’re going to dive into so-called orphaned resources.
A resource can become “orphaned” when it is detached from the infrastructure it was created to support, such as a volume detached from an instance or a snapshot detached from any volumes. Whether you are aware these remain in your cloud environment or not, they can continue to incur costs, wasting money and driving up your cloud bill.
How Resources Become Detached
One form of orphaned resources comes from storage. Volumes or disks such as Amazon EBS, when created, are attached to an EC2 instance. You can attach multiple volumes to a single instance to add storage space. If an instance is terminated but a volumes attached to it have not, “orphaned volumes” have been created. Note that by default, the boot disk attached to every instance is designed to terminate when the instance is terminated (although it is possible to deselect this option), but any additional disks that have been attached do not necessarily follow this same behavior.
Snapshots can also become orphaned resources. A snapshot is a point-in-time image of a volume. In the case of Amazon EBS, AWS snapshots are stored in Amazon S3. EBS snapshots are incremental, meaning, only the blocks on the device that have changed after your most recent snapshot are saved. If the associated instance and volume is deleted, a snapshot could be considered orphaned.
Are All Detached Resources Unnecessary?
Just because a resource is detached it does not mean it should be deleted. For example, you may want to keep:
The most recent snapshots backing up a volume
Machine images used to create other machines
Snapshots used to inexpensively store the state of a machine you intend to use later, rather than keeping a volume around
However, like the brownish substance in the tupperware in the back of your freezer, anything you want to keep needs to be clearly labeled in order to be useful. By default, snapshots and volumes do not always get automatically tagged with enough information to know what they actually are. In ParkMyCloud we see exabytes of untagged storage in our customers’ environments, with no way of knowing if it is safe to delete. In the AWS console, metadata is not cleanly propagated from the parent instance, and you have to go out of your way to tag snapshots before the parent instances are terminated. Once the parent instance is terminated, it can be impossible to identify the source of an orphaned volume or snapshot without actually re-attaching it to a running instance and looking at the data. Tag early and tag often!
The Size of Wasted Spend
To estimate the size of the problem of orphaned volumes and snapshots, we’ll start with some data from ParkMyCloud customers in aggregate. ParkMyCloud customers spend approximately 15% of their bills on storage. We found that 35% of that storage spend was on unattached volumes and snapshots. As detailed above, while this doesn’t mean that all of that is wasted, the lack of tagging and excess of snapshots of individual volumes indicates that much of it is.
Overall, an average of 5.25% of our customers’ bills is being spent on unattached volumes and snapshots. Then applied to the $50 billion estimated to be spent on Infrastructure as a Service (IaaS) this year gives us a maximum of up to $2.6 billion wasted this year on orphaned volumes and snapshots. This is a huge problem.
Based on the size of this waste and customer demand, ParkMyCloud is developing capabilities to add orphaned volume and snapshot management to our cost control platform.
Interested? Let us know here and we’ll notify you when this capability is released.
There’s a lot of talk about multi-cloud architecture – and apparently, a lot of disagreement about whether there is actually any logical use case to use multiple public clouds.
How many use multi-cloud already?
First question: are companies actually using a multi-cloud architecture?
According to a recent survey by IDG: yes. More than half (55%) of respondents use multiple public clouds: 34% use two, 10% use three, and 11% use more than three. IDG did not provide a term definition for multi-cloud. Given the limited list of major public clouds, the “more than three set” might be counting smaller providers. Or, respondents could be counting combinations such as AWS EC2 and Google G-Suite or Microsoft 365.
There certainly are some using multiple major providers – as one example, ParkMyCloud has at least one customer using compute infrastructure in AWS, Azure, Google Cloud, and Alibaba Cloud concurrently. In our observation, this is frequently manifested as separate applications architected on separate cloud providers by separate teams within the greater organization.
Why do organizations (say they) prefer multi-cloud?
With more than half of IDG’s respondents reporting a multi-cloud architecture, now we wonder: why? Or at least – since we humans are poor judges of our own behavior – why do they say they use multiple clouds? On survey, public cloud users indicated they adopted a multi-cloud approach to get best-of-breed platform and service options, while other goals included cost savings, risk mitigation, and flexibility.
Are these good reasons to use multiple clouds? Maybe. The idea of mixing service options from different clouds within a single application is more a dream than reality. Even with Kubernetes. (Stay tuned for a rant post on this soon).
Cloud economist Corey Quinn discussed this on a recent livestream with ParkMyCloud customer Rob Weaver. He asked Rob why his team at Workfront hadn’t yet completed a full Kubernetes architecture.
Rob said, “we had everything in a datacenter, and we decided, we’re going to AWS. We’re going there as fast as we can because it’s going to make us more flexible. Once we’re there, we’ll figure out how to make it save us money. We did basically lift and shift. …. Then, all of the sudden, we had an enormous deal come up, and we had to go into another cloud. Had we taken the approach of writing our own Lambdas to park this stuff, now GCP comes along. We would have to have written a completely different language, a completely different architecture to do the same thing. The idea of software-as-a-service and making things modular where I don’t really care what the implementation is has a lot of value.”
Corey chimed in, “I tend to give a lot of talks, podcasts, blog posts, screaming at people in the street, etc. about the idea that multi-cloud as a best practice is nuts and you shouldn’t be doing it. Whenever I do that, I always make it a point to caveat that, ‘unless you have a business reason to do it.’ You just gave the perfect example of a business reason that makes sense – you have a customer who requires it for a variety of reasons. When you have a strategic reason to go multi-cloud, you go multi-cloud. It makes sense. But designing that from day one doesn’t always make a lot of sense.”
So, Corey would say: Rob’s situation is the one use case where a multi-cloud architecture actually makes sense. Do you agree?
We get requests from customers occasionally about whether ParkMyCloud can manage Microsoft Azure Classic vs. ARM VMs. Short answer: no. Since Azure has already announced the deprecation of Azure classic resources – albeit not until March 2023 – you’ll find similar answers from other third-party services. Microsoft advises only to use resource manager VMs. And in fact, unless you already had classic VMs as of February 2020, you are not able to create new classic VMs.
As of February, though, 10% of IaaS VMs still used the classic deployment model – so there are a lot of users with workloads that need to be migrated in order to use third-party tools, new services, and avoid 2023 deprecation.
Azure Classic vs. ARM VM Comparison
Azure Classic and Azure Resource Manager (ARM) are two different deployment models for Azure VMs. In the classic model, resources exist independently, without groups for applications. In the classic deployment model, resource states, policies, and tags are all managed individually. If you need to delete resources, you do so individually. This quickly becomes a management challenge, with individual VMs liable to be left running, or untagged, or with the wrong access permissions.
Azure Resource Manager, on the other hand, provides a deployment model that allows you to manage resources in groups, which are typically divided by application with sub-groups for production and non-production, although you can use whatever groupings make sense for your workloads. Groups can consist of VMs, storage, virtual networks, web apps, databases, and/or database servers. This allows you to maintain consistent role-based access controls, tagging, cost management policies, and to create dependencies between resources so they’re deployed in the correct order. Read more: how to use Azure Resource Groups for better VM management.
How to Migrate to Azure Resource Manager VMs
For existing classic VMs that you wish to migrate to ARM, Azure recommends planning and a lab test in advance. There are four ways to migrate various resources:
Migration of VMs, not in a virtual network – they will need to be on a virtual network on ARM, so you can choose a new or existing virtual network. These VMs will need to be restarted as part of the migration.
Migration of VMs in a virtual network – these VMs do not need to be restarted and applications will not incur downtime, as only the metadata is migrating – the underlying VMs run on the same hardware, in the same network, and with the same storage.
Migration of storage accounts – you can deploy Resource Manager VMs in a classic storage account, so that compute and network resources can be migrated independently of storage. Then, migrate over storage accounts.
Migration of unattached resources – the following may be migrated independently: storage accounts with no associated disks or VMs, and network security groups, route tables, and reserved IPs that are not attached to VMs or networks.
There are a few methods you can choose to migrate:
Use the Azure classic CLI – note that you must use classic to migrate classic resources.