AWS Reserved Instances vs. Scheduling On-Demand Instances
We are frequently asked about the impact of instance scheduling on AWS Reserved Instances for EC2 and RDS. Scheduling On Demand instances as well as Reserved Instances (RIs) are both useful techniques for cost optimization, but they are polar opposites in terms of goals. RIs are all about getting a better price for RDS or EC2 instances that are running all the time. Scheduling is all about reducing costs by turning off instances when they are not in use.
How should a customer choose between buying an AWS Reserved Instance and applying scheduling to the instance? This is an important question, as usually the savings from scheduling can exceed the savings available from a Reserved Instance. Before we dive into that though, let’s get familiar with some critical RI nuances that can help you make the decision…and make the most of your RIs.
Note: versions of this article were originally published in 2015 and 2017. It has been completely re-analyzed, rewritten, and updated!
Where are my Reserved Instances?
First of all, it’s worth clarifying that there is no functional difference between Amazon Reserved Instances and On Demand. It’s all in the billing. <rant> I wish I had a dollar for each time someone asked me (while looking at the AWS Console) “Which ones of these are the Reserved Instances?” </rant>. You cannot be sure which instances are using a reservation until you get your bill. You can make a good guess, based on what account you are looking at and what reservations you have purchased, but if you are using instance size flexibility with region-based RIs, it would be a pure guess. An instance reservation is not like a hotel reservation, where you end up with a specific room for the duration of your stay. That would be a Dedicated Host Reservation – a whole other beast with its own set of limitations.
3600 Seconds per Hour
Yep – there are 3600 seconds in an hour. I did the math. You may ask: why does that matter? It matters because of two things: per-second billing for EC2, and the fact that AWS RIs are billed in one hour chunks. More precisely, RIs are billed in one “clock-hour” chunks, where a clock-hour starts on the hour and up to the final second of the hour – like 04:00:00 to 04:59:59.
If you are running a number of instances that use per-second billing, and they go up or down during a (clock) hour, then multiple instances may be able to leverage the Reserved Instance pricing.
As paraphrased shamelessly from here, if you purchase one m5.large Reserved Instance and run four matching m5.large instances for 15 minutes each (900 seconds each) in the same hour, the total running time is one hour, which consumes the entire Reserved Instance allocation. This usage pattern is illustrated below.
That said, if multiple matching instances are running at the same time, the Reserved Instance billing benefit is applied to all of the instances, up to a maximum of 3600 seconds in a clock-hour. On-demand rates are used for any usage after that. This is illustrated below.
By the way, remember that per-second billing only applies to instances running an open-license OS like Amazon Linux or Ubuntu.
If your instance is running any flavor of Windows, Red Hat Linux, or any pay-by-the-hour image from the AWS Marketplace, that instance is billed by the hour, and any matching reservation will be consumed by the hour, even if the instance is not. Any overlapping instances will each need their own RI to get RI pricing.
Elastic Reserved Instances
Wait…what? Isn’t that kind-of an oxymoron? Elastic means it should be able to change with my usage…and aren’t reservations are a commitment to usage? Well, maybe…but some RIs are definitely more elastic (or at least more flexible) than others. Regional Reserved Instances can leverage “instance size flexibility.” This is the ability of a reservation to apply up or down to other instance types in the same family. Regional RIs are applied in priority order from the smallest instance in the region to the largest instance in the region. This makes these AWS reservations somewhat elastic in that if you buy reservations for smaller instance types than you normally use, then those reservations are more likely to be consumed, even if you rightsize an instance or replace one server with another of a different size.
The math of how an RI of one instance type can be applied to another instance type is governed by a “normalization factor”, described here. Putting the table in that link another way:
Some examples of how we can use this table:
To fully cover one t3.medium with RIs, you would buy eight t3.nano RIs. If you stop using the t3.medium instance, the same eight t3.nano RIs could also be used to cover:
Two t3.small instances
One t3.small and two t3.micro
Half of a t3.large
To fully cover an m5.12xlarge would require 24 m5.large RIs. Those 24 m5.large RIs could also be used for:
One m5.8xlarge and one m5.4xlarge
One m5.8xlarge and two m5.2xlarge
One m5.8xlarge, one m5.2xlarge, one m5.xlarge, and two m5.large
Three quarters of an m5.16xlarge
The lesson here is that it is easiest to manage reserved instances if you buy the smallest instance size available within an instance family. In fact, as shown below this is the exact type of recommendation you are likely to receive from the AWS Cost Explorer.
So why do we mention AWS flexible reserved instances with relation to schedules? Recall that the allocation of RIs to instances is done by AWS each second within a clock-hour. Each reservation is consumed by instances in its family until all 3600 seconds have been allocated. You do not necessarily have to keep an instance up for the whole hour to use the reservation. You can still consume the whole RI, even if you are starting and stopping instances by schedule or within an auto-scaling group over the course of an hour.
Something so cool cannot be without limitations and gotchas. Here are a few we’ve noticed:
Instance size flexibility is ONLY available for certain instance platforms.
For AWS EC2 reserved instances, only open-license linux instances support flexible RIs. It is NOT APPLICABLE to any flavor of Windows or other licensed software and OS.
For AWS RDS reserved instances, instance size flexibility is available for MySQL, MariaDB, PostgreSQL, and Amazon Aurora database engines as well as the “Bring your own license” (BYOL) edition of the Oracle. [That said: it is not clear as of this writing if the recent RDS per-second billing change is also applied to RDS instance size flexibility. The billing pages still state that RDS RIs are allocated to instances on a per-hour basis.]
Flexibility is only available within the same instance family
Flexibility is only available for Regional RIs
How does AWS Reserved Instance Pricing Work?
As usual, there are a number of different decisions that need to be made before this question can be answered. AWS RIs are available with two different levels of flexibility and several different terms of purchase.
In terms of flexibility, RIs can either be purchased as “standard” or “convertible.”
AWS Convertible Reserved Instances – allow changing the assigned availability zone, instance size/families, operating system, tenancy, and networking type. There is no fee for changing the reservation attributes, but you can only convert the RI into a new set of attributes with equal or greater cost than the original. If the cost is greater, you will have to pay the difference.
Regional vs. Availability Zone assignment scope. Regional scope offers a lot more flexibility for where your instances can reside while still leveraging RI pricing. Availability Zone scope has the bonus of a guaranteed capacity reservation (though, like RI pricing, capacity reservation cannot be assigned to a specific instance).
For open-license Linux, instance size flexibility, as described earlier.
There are a few options for terms of purchase that affect AWS RI pricing. RIs can be purchased for 1 or 3-year terms, and with no upfront, partial upfront, or all upfront payment available for each.
To get a better idea of the difference in the amount you’ll end up paying with each purchasing option, let’s take a look at some pricing scenarios for the general purpose m5.large instance in us-east-1 (US Virginia). When you look at these on the AWS RI price list, they are usually grouped by year terms and standard vs. convertible. Since you can already see it that way on AWS, and because time is the greatest unknown, I am going to give a bit of a different view, keeping the years together on the same comparison chart.
One-Year Term RIs
To benefit from the discount provided by a reservation while keeping contract terms to a minimum, look at the 1-year RIs:
You can see from this that within the Convertible vs. Standard tiers, there is not a lot of difference between the upfront cost payment options. Your upfront cash decision here can depend on your accounting process, perhaps accounting for the difference between CapEx and OpEx (Capital Expenditures and Operating Expenditures), or based on your current and prospective cash flow. The pricing between the Convertible vs. Standard options are so similar as to make you seriously consider if the benefit of flexibility in the AWS Reserved Instances convertible option outweighs the minor cost savings.
Three-Year Term RIs
If are confident of your three-year usage plans, look at the 3-year RIs, shown here as the cost per year:
There is a bit more visible progression between the options here, and the cost savings difference between the 1-year graph and the 3-year graph are clearly seen. Still, the 1-year vs. 3-year decision may again come down to accounting practices.
Note that these graphs are just for one instance type in one location in about the middle of the overall performance pack. They do not express a true profile of all instance types/families in all locations, and you should do your own comparison before buying any RIs. That said….
AWS Reserved Instances vs. On Demand
FINALLY getting into the core question. Is it better to shut an instance down or buy a reserved instance? Let’s do the math and compare AWS On Demand vs. Reserved, with on/off scheduling applied to the On Demand option. Given all the different pricing options it can be a bit difficult to decide how to compare these. Let’s go with the no-upfront RI options and see where the RIs break even with scheduling.
Since the hours might be judge at this scale, here are the break-even hours in a table (hours rounded-up):
So what do these hours mean?
By way of comparison, here are a few common schedules and their hours in ParkMyCloud (with weeks converted to months on the basis of 52 weeks/12 months = 4.33 weeks/month):
Running 8 AM – 5 PM, Weekdays = 195 hours/month (nominally a 73% savings)
Of the above, 7 AM – 7 PM Weekdays is the second most popular schedule in our entire system, and at 260 hours, easily is more cost-effective than the least expensive RI.
In that final schedule above, you need to run 16 hours per day, Monday-Friday before a 3-year RI starts to look like a better deal. If you have other matching instances on another schedule in that same region/AZ, you could then buy the RI anyway, and save even more money.
Scheduled Reserved Instances
“But wait!” you say… “There are these things called Scheduled Reserved Instances – would they not offer me the best of both worlds?” Maybe, but note that to support Scheduled RIs, AWS has essentially set aside hardware for use as Scheduled RIs – this is a finite set of dedicated resources. On paper, a Scheduled RI can definitely save you a bit more money, but let’s look at the limitations.
Instances intended to consume a Scheduled Reservation must be launched during their scheduled time periods using a launch configuration that matches the reservation. Let’s break that down further:
They are launched with “a launch configuration” and terminated by EC2 three minutes before the end of the schedule window. You cannot stop/start/suspend them, and so they cannot maintain state internally unless you do some EBS volume tweaks after launch.
You cannot launch a scheduled RI outside of its scheduled window. If you try, that instance will not use the Reservation, as you would not expect it to magically jump over onto the dedicated hardware.
Only three regions and only a few older generation instances families are supported.
Limited quantities are available (as of this writing, a search for an m4.large in us-east-1 with a Monday-Friday 7 AM – 7 PM scheduled listed only two reservations being available).
There is a minimum reservation size of 1,200 hours/year, 100 hours/month, 24 hours/week, or 4 hours/day. (So you are not going to be able to use a Scheduled RI to run your database backup server for just a couple hours on Sunday mornings [I just tried…dang it.])
Scheduled RI purchases cannot be cancelled, modified or sold on the RI Marketplace. No buyer’s remorse allowed here…
And finally… AWS describes the cost savings of Scheduled RIs over on-demand as being in the 5-10% range. Again – you have to be really sure this workload is going to be needed on this schedule, and the workload is fully utilized. If you find this server is ever idle or under-utilized, you may have lost an opportunity to save 50% or more by simply scheduling it with ParkMyCloud and then rightsizing it.
When launched in 2016, Scheduled RIs were only available in “US East (N. Virginia), US West (Oregon), and Europe (Ireland) regions, with support for the C3, C4, M4, and R3 instance types.” Given that this has not changed, I am guessing the feature has not taken off as AWS had hoped. That said, if you have something that needs to run for 4 hours per day, daily, and can use an older instance family in a supported region…go for it! This may also be a good place to leverage AWS Spot instance types.
Should You Buy Reserved Instances?
Deciding between different RI options starts with knowledge of how stable your usage is going to be in the coming years. Most organizations have a good understanding of at least a baseline level of usage, and some actually construct portfolios of reserved instance purchases, much like how you might balance a stock portfolio. For example, 3-year Standard Reserved Instances are for those applications that are stable and unlikely to change. Balance those with some 3-year Convertible Reserved Instances to help save money in the longer term but allow for some flexibility in use. Use 1-year RIs to account for shorter-term usage volatility, with again maybe a balance between Standard and Convertible where needed for those loads that might shift over the short term.
Just like with stocks, such portfolios require active management, awareness of what is being used and what is coming in both the short and long term. Companies with these types of RI management typically have dedicated cloud cost management staff, though RI management itself does not need to be a full-time job. One of our larger customers tells us they do a quarterly review of their RI portfolio, starting with using ParkMyCloud to schedule any instances they can, and then rebalancing their RIs and investing in more as needed. This is a best practice for any size company.
For production workloads that run 24×7 long-term, you REALLY should be buying RIs. For production workloads with an unknown future, or for non-production workloads (dev, test, QA, training, etc.), the answer is “probably not”. Be careful though, as often RIs are seen as a quick fix to save 30-50% on the cloud bill, but then other ways are found to save even more, and you end up with underutilized RIs. Before you buy a Standard RI, make sure you have evaluated your resources for underutilization and overprovisioning. Also consider the following cost-saving options:
Choose smaller instance sizes – each size down can save 50%. Use ParkMyCloud Rightsizing to first make sure your resources are appropriately sized for their load before committing to an RI purchase.
For open-license Linux instances, buy RIs for smaller instance types in the same instance family in order to leverage instance size flexibility (allows the RIs to more easily be allocated against other resources, even if they are running for short periods of time)
Use Spot Instances and Autoscaling when appropriate for short-term/volatile workloads.
Schedule on/off times for on-demand instances. Use ParkMyCloud SmartParking to automatically create schedules for when the resources are not being used, while still giving your staff the flexibility to easily restart them if needed outside the normal hours. Even a generous 7 AM – 7 PM workday schedule will immediately save 65%!
The worst thing one can do is…nothing. Set aside some time each quarter to review EC2 and RDS instance utilization. Rightsize and schedule what you can – try a free trial of ParkMyCloud to see how we can help you with that. After that, anything left running 24×7 is a candidate for an RI. If you are risk-averse or time-crunched, stick with 1-year convertible RIs to save at least 36% on at least some of your resources. Then take all that money you saved and put it into other things that will bring greater value to your business.
Earlier this week, AWS announced the launch of AWS resource optimization recommendations within their cost management portal. AWS claims that this will “identify opportunities for cost efficiency and act on them by terminating idle instances and rightsizing under-used instances.” Here’s what that actually means, and what functionality AWS still does not provide that users need in order to automate cost control.
AWS Recommendations Overview
AWS Recommendations are an enhancement to the existing cost optimization functionality covered by AWS Cost Explorer and AWS Trusted Advisor. Cost Explorer allows users to examine usage patterns over time. Trusted Advisor alerts users about resources with low utilization. These new recommendations actually suggest instances that may be a better fit.
AWS Resource Optimization provides two types of recommendations for EC2 instances:
Terminate idle instances
Rightsize underutilized instances
These recommendations are generated based on 14 days of usage data. It considers “idle” instances to be those with lower than 1% peak CPU utilization, and “underutilized” instances to be those with maximum CPU utilization between 1% and 40%.
While any native functionality to control costs is certainly an improvement, users often express that they wish AWS would just have less complex billing in the first place.
AWS Resource Optimization Tool vs. ParkMyCloud
ParkMyCloud offers cloud cost optimization through RightSizing for AWS, as well as Azure and Google Cloud, in addition to our automated scheduling to shut down resources when they are idle. Note that AWS’s new functionality does not include on/off schedule recommendations.
Here’s how the new AWS resource optimization tool stacks up against ParkMyCloud.
Types of Recommendations Generated
The AWS Resource Optimization tool will provide up to three recommendations for size changes within the same instance family, with the most conservative recommendation listed as the primary recommendation. Putting it another way, the top recommendation will be one size down from the current instance, the second recommendation will be two sizes down, etc. ParkMyCloud recommends the optimal instance type and size for the workload, regardless of the existing instance’s configuration. This includes instance modernization recommendations, which AWS does not offer.
The AWS tool generates recommendations for EC2 instances only, while ParkMyCloud recommends scheduling and RightSizing recommendations for EC2 and RDS. AWS also does not support GPU-based instances in its recommendations, while ParkMyCloud does.
AWS customers must explicitly enable generation of recommendations in the AWS Cost Management tools. In ParkMyCloud, recommendations are generated automatically (with some access limitations based on subscription tier).
ParkMyCloud allows you to manage resources across a multi-cloud environment including AWS, Azure, Google Cloud, and Alibaba Cloud. AWS’s tool, of course, only allows you to manage AWS resources.
When you start to dig in, you’ll notice several limitations of the recommendations provided by AWS. The recommendations are based on utilization data from the last 14 days, a range that is not configurable. ParkMyCloud’s recommendations, on the other hand, can be based on a configurable range of 1-24 weeks of data, configurable by the customer by team, cloud provider, and resource type.
Another important aspect of “optimization” that AWS does not allow the user to configure are the utilization thresholds. AWS assumes that any instance at less than 1% CPU utilization is idle, and assumes any instance between 1-40% CPU utilization is undersized. While these are reasonable rules of thumb, users need the ability to customize such thresholds to best suit their own environment and use cases, and AWS does not allow you to customize these parameters. AWS also assumes an “all or nothing” approach – they recommend that any instance detected as idle simply be terminated. ParkMyCloud does not assume that low utilization means the instance should be terminated, but suggests sizing and/or scheduling solutions with specificity to the utilization patterns. ParkMyCloud allows users to select between Conservative, Balanced, or Aggressive schedule recommendations with customizable thresholds.
AWS also only evaluates “maximum CPU utilization” to determine idle resources. However, for resource schedule recommendations, ParkMyCloud uses both Peak and Average CPU plus network utilization for all instances, and memory utilization for instances with the CloudWatch agent installed. For sizing recommendations, ParkMyCloud uses maximum Average CPU plus memory utilization data if available.
Perhaps the most dangerous aspect of the AWS Recommendation is they will recommend an instance size change based on CPU alone, even if they do not have memory metrics. Without cross-family recommendation, this means each size down typically cuts the memory in half. ParkMyCloud Rightsizing Recommendations do not assume this is OK. In the absence of memory metrics, we make downsizing recommendations that keep memory constant. For a concrete example of this, here is an AWS recommendation to downsize from m5.xlarge to m5.large, cutting both CPU and memory, and resulting in a net savings of $60 per month.
In contrast, here is the ParkMyCloud Rightsizing Recommendation for the same instance:
You can see that while the AWS recommendation can save $60 per month by downsizing from m5.xlarge to m5.large, the top ParkMyCloud recommendation saves a very similar $57.67 by allowing the transition from m5.xlarge to r5a.large, keeping memory constant. While the savings are off by $2.33, this a far less risky transition and probably worth the difference. In both cases, of course, memory data from the CloudWatch Agent would likely result in better recommendations.
As shown in the AWS recommendation above, AWS provides the “RI hours” for the preceding 14 days, giving better visibility into the impact of resizing on your reserved instance usage, and uses this data for the cost and savings calculations. ParkMyCloud does not yet provide correlation of the size to RI usage, though that is planned for a future release. That said, the AWS documentation also states “Rightsizing recommendations don’t capture second-order effects of rightsizing, such as the resulting RI hour’s availability and how they will apply to other instances. Potential savings based on reallocation of the RI hours aren’t included in the calculation.” So the RI visibility on the AWS side has minimal impact on the quality of their recommendations.
If the user is viewing the AWS Recommendation from within the same account as the target EC2 instance, a “Go to the Amazon EC2 Console” button appears on the recommendation details, but it leads to the EC2 console for whatever your last-used region was, and without an automatic filter for the specific instance ID. This means you need to do your own navigation to the right region (perhaps also requiring a new console login if the recommendation is for a different account in the Organization), and then find the instance to see the details. ParkMyCloud provides better ease-of-use in that you can jump directly from the recommendation into the instance details, regardless of your AWS Organization account structure. ParkMyCloud: 1 click. AWS: At least five, plus copy/paste of the instance ID and plus possibly a login.
ParkMyCloud also shows utilization data for the recommendation below the recommendation text, giving excellent context. AWS again requires navigation the right account, EC2 and then region, or to CloudWatch and the right metrics using the instance ID.
AWS Resource Optimization also ignores instances that have not been run for the past three days. ParkMyCloud takes this lack of utilization into consideration and does not discard these instances from recommendations.
AWS regenerates recommendations every 24 hours. ParkMyCloud regenerates recommendations based on the calculation window set by the customer.
Automation & Ease of Use
While AWS’s new recommendations are generated automatically, they all must be applied manually. ParkMyCloud allows users to accept and apply scheduling recommendations automatically, via a Policy Engine based on resource tagging and other criteria. RightSizing changes can be “applied now”, or scheduled to occur in the future, such as during a designated maintenance window.
There is also the question of visibility and access across team members. In AWS, users will need access to billing, which most users will not have. In ParkMyCloud, Team Leads have access to view and execute recommendations for resources assigned to their respective teams. Additionally, recommendations can be easily exported, so business units or teams can share and review recommendations before they’re accepted if required by their management process.
AWS’s management console and user interface are often cited as confusing and difficult to use, a trend that has unfortunately carried forward to this feature. On the other hand, ParkMyCloud makes resource management straightforward with a user-friendly UI.
Want to see what ParkMyCloud will recommend for your environment? Try it out with a free trial, which gives you 14-day access to our entire feature set, and you can see what cost optimization recommendations we have for you.
While going through our recent Cloud Cost Optimization Competency review with AWS, one of the things they asked us to do was remove the ability for customers to sign up for our service using AWS IAM User credentials. They loved the fact that we already supported AWS IAM Role credentials, but their concern was that AWS IAM User credentials could conceivably be stolen and used from outside AWS by anyone. (I say inconceivable, but hey, it is AWS.) This was a bit of a bitter pill to swallow, as some customers find IAM Users easier to understand and manage than IAM Roles. The #1 challenge of any SaaS cloud management platform like ours is customer onboarding, where every step in the process is one more hurdle to overcome.
While we could debate how difficult it would be to steal a customer cloud credential from our system, the key (pun intended) thing here is why is an IAM Role preferred over an IAM User?
Before answering that question, I think it is important to understand that an IAM Role is not a “role” in perhaps the traditional sense of Active Directory or LDAP. An AWS IAM Role is not something that is assigned to a “User” as a set of permissions – it is a set of capabilities that can be assumed by some other entity. Like putting on a hat, you only need it at certain times, and it is not like it is part of who you are. As AWS defines the difference in their FAQ:
An IAM user has permanent long-term credentials and is used to directly interact with AWS services. An IAM role does not have any credentials and cannot make direct requests to AWS services. IAM roles are meant to be assumed by authorized entities, such as IAM users, applications, or an AWS service such as EC2.
(The first line of that explanation alone has its own issues, but we will come back to that…)
The short answer for SaaS is that a customer IAM Role credential can only be used by servers running from within the SaaS provider’s AWS Account…and IAM User credentials can be used by anyone from anywhere. By constraining the potential origin of AWS API calls, a HUGE amount of risk is removed, and the ability to isolate and mitigate any issues is improved.
What is SaaS?
Software as a Service (SaaS) means different things to different vendors. Some vendors claim to be “SaaS” for their pre-built virtual machine images that you can run in your cloud. Maybe an intrusion detection system or a piece of a cloud management system. In my (truly humble) opinion this is not a SaaS – this is just another flavor of “on prem” (on-premise), where you are running someone’s software in your environment. Call it “in-cloud” if you do not want to call it “on-prem”, but it is not really SaaS, and it does not have the challenges you will experience with a “true” SaaS product – coming in from the outside. A core component of SaaS is that it is centrally hosted – outside your cloud. For an internal service, you might relax permissions and access mechanisms somewhat, as you have total control over data ingress/egress. A service running IN your network…where you have total control over data ingress/egress…is not the same as external access – the epitome of SaaS. Anyway: </soapbox>. (Or maybe </rant> depending on the tone you picked up along the way…)
The kind of SaaS I am focussing on for this blog is SaaS for cloud management, which can include cloud diagramming tools, configuration management tools, storage management+backup tools, or cost optimization tools like ParkMyCloud.
AWS has enabled SaaS for secure cloud management more than any other cloud provider. A bold statement, but let’s break that down a bit. We at ParkMyCloud help our customers optimize their expenses at all of the major cloud providers and so obviously all the providers allow for access from “outside”. Whether it is an Azure subscription, a GCP project, or an Alibaba account, these CSP’s are chiefly focussed on customer internal cross-domain access. I.e., the ability of the “parent” account to see and manage the “child” accounts. Management within an organization. But AWS truly acknowledges and embraces SaaS.
You could attribute my bold statement to an aficionado/fanboi notion of AWS having a bigger ecosystem vision, or more specifically that they simply have a better notion of how the Real World works, and how that has evolved in The Cloud. The fact is that companies buy IT products from other companies…and in the cloud that enables this thing called Software as a Service, or SaaS. All of the cloud providers have enabled SaaS for cloud access, but AWS has enabled SaaS for more secure cloud access.
AWS IAM Cross-account Roles
So…where was I? Oh…right…Secure SaaS access.
OK, so AWS enables cross-account access. You can see this in the IAM Create Role screen in the AWS Console:
If your organization owns multiple AWS accounts (inside or outside of an AWS “organization”), cross-account access allows you to use a parent account to manage multiple child accounts. For SaaS, cross-account access allows a 3rd-party SaaS provider to see/manage/do stuff with/for your accounts.
Looking a little deeper into this screen, we see that cross-account access requires you to specify the target account for the access:
The cross-account role allows you to explicitly state which other AWS account can use this role. More specifically: which other AWS account can assume this role.
But there is an additional option here talking about requiring an “external ID”…what is that about?
Within multiple accounts in a single organization, this may allow you to differentiate between multiple roles between accounts….maybe granting certain permissions to your DevOps folks…other permissions to Accounting…and still other permissions to IT/network management.
If you are a security person, AWS has some very interesting discussions about the “confused deputy” problem mentioned on this screen. It discusses how a hostile 3rd party might guess the ARN used to leverage this IAM Role, and states that “AWS does not treat the external ID as a secret” – which is all totally true from the AWS side. But summing it up: cross-account IAM Roles’ external IDs do not protect you from insider attacks. For an outsider, the External ID is as secret as the SaaS provider makes it.
Looking at it from the external SaaS side, we get a bit of a different perspective. For SaaS, the External ID allows for multiple entry points…and/or a pre-shared secret. At ParkMyCloud (and probably most other SaaS providers) we only need one entry point, so we lean toward the pre-shared secret side of things. When we, and other security-conscious SaaS providers, ask for access, we request an account credential, explicitly giving our AWS account ID and an External ID that is unique for the customer. For example, in our UI, you will see our account ID and a customer-unique External ID:
Assume Role…and hacking SaaS
If we look back at the definition of the AWS IAM Role, we see that IAM roles are meant to be assumed by authorized entities. For an entity to assume a role, that party has to be an AWS entity that has the AWS sts:AssumeRole permission for the account in which it lives. Breaking that down a bit, the sts component of this permission tells us this comes from the AWS Secure Token Services, which can handle whole chains of delegation of permissions. For ParkMyCloud, we grant our servers in AWS an IAM Role that has the sts:AssumeRole permission for our account. In turn, this allows our servers to use the customer account ID and external ID to request permission to “Assume” our limited-access role to manage a customer’s virtual machines.
From the security perspective, this means if a hostile party wanted to leverage SaaS to get access to a SaaS customer cloud account via an IAM User, they would need to:
Learn an account ID for a target organization
Find a SaaS provider leveraged by that target organization
Hack the SaaS enough to learn the External ID component of the target customer account credentials
Completely compromise one of the SaaS servers within AWS, allowing for execution of commands/APIs to the customer account (also within AWS), using the account ID, External ID, and Assume Role privileges of that server to gain access to the customer account.
Have fun with the customer SaaS customer cloud, but ONLY from that SaaS server.
So….kind-of a short recipe of what is needed to hack a SaaS customer. (Yikes!) But this is where your access privileges come in. The access privileges granted via your IAM role determine the size of the “window” through which the SaaS provider (or the bad guys) can access your cloud account. A reputable SaaS provider (ahem) will keep this window as small as possible, commensurate with the Least Privilege needed to accomplish their mission.
Also – SaaS services are updated often enough that the service might have to be penetrated multiple times to maintain access to a customer environment.
So why are AWS IAM Users bad?
Going back to the beginning, our quote from AWS stated “An IAM user has permanent long-term credentials and is used to directly interact with AWS services”. There are a couple frightening things here.
“Permanent long-term credentials” means that unless you have done something pretty cool with your AWS environment, that IAM User credential does not expire. An IAM User credential consists of a Key ID and Secret Access Key (an AWS-generated pre-shared secret) that are good until you delete them.
“…directly interact with AWS services” means that they do not have to be used from within your AWS account. Or from any other AWS account. Or from your continent, planet, galaxy, dimension, etc. That Key ID and Secret can be used by anyone and anywhere.
From the security perspective, this means if a hostile party wanted to leverage SaaS to get access to a SaaS customer cloud account via an IAM Role, they would need to:
Learn an account ID for a target organization
Find a SaaS provider leveraged by that target organization
Hack the SaaS enough to get the IAM User credentials.
Have fun…from anywhere.
So this list may seem only a little bit shorter, but the barriers to compromise are somewhat lower, and the opportunity for long-term compromise is MUCH longer. Any new protections or updates for the SaaS servers has no impact on an existing compromise. The horse has bolted, so shutting the barn door will not help at all.
What if the SaaS provider is not in AWS? Or…what if *I* am not in AWS?
The other cloud providers provide some variation of an access identifier and a pre-shared secret. Unlike AWS, both Azure and Google Cloud credentials can be created with expiration dates, somewhat limiting the window of exposure. Google does a great job of describing their process for Service Accounts here. In the Azure console, service accounts are found under Azure AD>App registrations>All apps>App details>Settings>Keys, and passwords can be set to expire in 1 year, 2 years, or never. I strongly recommend you set reminders someplace for these expiration dates, as it can be tricky to debug an expired service account password for SaaS.
For all providers you can also limit your exposure by setting a very limited access role for your SaaS accounts, as we describe in our other blog here.
Azure does give SaaS providers the ability to create secure “multi-tenant” apps that can be shared across multiple customers. However, the API’s for SaaS cloud management typically flow in the other direction, reaching into the customer environment, rather than the other way around.
IAM Role – the Clear Winner
Fortunately, when AWS “strongly recommended” that we should discontinue support for AWS IAM User-based permissions, we already supported an upgrade path, allowing our customer to migrate from IAM User to IAM Role without losing any account configuration (phew!). We have found some scenarios where IAM Role cannot be used – like between the AWS partitions of AWS global, AWS China, and the AWS US GovCloud. For GovCloud, we support ParkMyCloud SaaS by running another “instance” of ParkMyCloud from within GovCloud, where cross-account IAM Role is supported.
With the additional security protections provided for cross-account access, AWS IAM Role access is the clear winner for SaaS access, both within AWS and across all the various cloud providers.
The principle of least privilege is important to understand and follow as you adopt SaaS technologies. The market for SaaS-based tools is growing rapidly, and can typically be activated much more quickly and cheaply than creating a special-purpose virtual machine within your cloud environment. In this blog, I am focusing specifically on the SaaS cloud management tool area, which can include services like cloud diagramming tools, configuration management tools, storage management and backup tools, or cost optimization tools like ParkMyCloud.
Why the Principle of Least Privilege is Important
Before you start using such tools and services, you should carefully consider how much access you are granting into your cloud. The principle of least privilege is a fundamental tenet of any identity and access control policy, and basically means a service or user should have no more permissions than absolutely required in order to do a job.
Cloud account privileges and permissions are typically granted via roles and permissions. All of the cloud providers provide numerous predefined roles, which consist of pre-packaged sets of permissions. Before granting any requested predefined role to a 3rd-party, you should really investigate the permissions or security policy embedded in that role. In many (most?) cases, you are likely to find that the predefined roles give a lot more information or capabilities away than you are really likely to want.
SaaS Onboarding – Where Least Privilege Can Get Lost
For on-boarding of new SaaS customers, the initial permissions setup is often the most complicated step, and some SaaS cloud management platforms try to simplify the process by asking for one of these predefined roles. For example, the Amazon ReadOnlyAccess role or the Azure Reader role or the GCP roles/viewer role. While this certainly makes onboarding of SaaS easier, it also exposes you to a massive data leakage problem. For example, with the Amazon ReadOnlyAccess role a cloud diagramming tool can certainly get a good enough view of your cloud to create a map…but you are also granting read access for all of your IAM Users, CloudTrail events and history, any S3 objects you have not locked-down with a distinct bucket policy, and….lots of other stuff you probably do not even know you have. It is like kinda like saying – “Here, please come on in and look at all of our confidential file cabinets – and it is OK for you to make copies of anything interesting, just please do not change any of our secrets to something else…” No problem, right?
Obviously, least privilege becomes especially critical when giving permissions to a SaaS provider, given the risk of trusting your cloud environment to some unknown party.
Custom Policies for SaaS
Because of the broad nature of many of their predefined roles, all of the major cloud providers give you the ability to assign specific permissions to both internal and external users through Policies. For example, the following policy snippets show the minimum permissions ParkMyCloud requests to list, start, and stop virtual machines on AWS, Google, and Azure.
Creating and assigning these permissions makes SaaS onboarding a bit more complicated, but it is worth the effort in terms of reducing your exposure.
Other Policy Restrictions
What if you want to give a SaaS provider permissions, but lock it down to only certain resources or certain regions? AWS and Azure allow you to specify in the policy which resources the policy can be applied to. Google Cloud….not so much. AWS takes this the farthest, allowing for very robust policies down to specific services, and the addition of tag-based caveats for the policy permissions, for example:
This policy locks down the Start and Stop permissions to only those instances that have the tag name/value parkmycloud: yes,and are located in the us-east-1region. Similar Conditions can be used to lock this down by region, instance type, and many other situations. (This recent announcement shows another way to handle the region restriction.)
Azure has somewhat similar features, though with a slightly different JSON layout, as described here. It does not appear you can use resource tags to for Azure, nor does Azure provide easy ways to limit the geographic scope of permissions. You can get around the location and grouping of resources by using Azure Management Groups, but that is not quite as flexible as an arbitrary tag-based system, and is actually more intended to aggregate resources across subscriptions, rather than be more specific within a subscription. That said, the Azure permissions defined here are a bit more granular than AWS. This does allow for a bit more specificity in permissions if it is needed, but can no doubt grow tedious to list and manage.
Google Cloud provides a long list of predefined roles here, with an excellent listing the contained permissions. There is also an interesting page describing the taxonomy of the permissions here, but Google Cloud appears to make it a bit difficult to enumerate and understand the permissions individually, outside of the predefined roles. Google does not provide any tag or resource-based restrictions, apart from assignment at the Project level. More on user management and roles by cloud provider in this blog.
You may note that the ec2:Describe permission in our last example does not have the tag-based restriction. This is because the tag-based restriction can only be used for certain permissions, as shown in the AWS documentation. Note also that some APIs can do several different operations, some of which you may be OK with sharing, and others not. For example, the AWS ModifyInstance permission allows the API user to change the instance type. But…this one API (and associated permission) also allows the API user to modify security group assignments, shutdown behaviors, and other features – things you may not want to share with an untrusted 3rd party.
Key takeaway here? Look out for permissions that may have unexpected consequences.
Beware of SaaS cloud management providers who are asking for simple predefined roles from your cloud provider. They are either giving a LOT more functionality than you are likely to want from a single provider, or they are asking for a lot more permissions than they need. Ask for a “limited access policy” that gives the SaaS provider ONLY what they need, and look for a document that defines these permissions and how they tie back to what the SaaS provider is doing for you.
These limited access policies serve to limit your exposure to accidents or compromises at the SaaS provider.
Serving sizing in the cloud can be tricky. Unless you are about to do some massive high-performance computing project, super-sizing your cloud virtual machines/instances is probably not what you are thinking about when you log in to your favorite cloud service provider. But from looking at customer data within our system, it certainly does look like a lot of folks are walking up to their neighborhood cloud provider and saying exactly that: Super Size Me!
Like at a fast-food place, buying in the super size means paying extra costs…and when you are looking for ways to save money on cloud costs, whether for production or non-production resources, the first place to look is at idle and underutilized resources.
Within the ParkMyCloud SaaS platform, we have collected bazillions (scientific term) of samples of performance data for tens of thousands of virtual machines, across hundreds of customers, and the average of all “Average CPU” readings is an amazing (even to us) 4.9%. When you consider that many of our customer are already addressing underutilization by stopping or “parking” their instances when they are not being used, one can easily conclude that the server sizing is out of control and instances are tremendously overbuilt. In other words, they are much more powerful than they need to be…and thus much more expensive than they need to be. As cool as “super sizing” sounds, the real solution is in rightsizing, and ensuring the instance size and type are better tailored to the actual load.
Before we start talking about what is involved in rightsizing, let’s look at a few more statistics, just because the numbers are pretty cool. Looking at utilization data from about 88.9 million instance-hours on AWS – that’s 10,148 years – we find the following:
So, what is this telling us about server sizing? The percentiles alone tell us that more than 95% of our samples are operating at less than 50% Average CPU – which means if we cut the number of CPUs in half for most of our instances, we would probably still be able to carry our workload. The 95th percentile for Peak CPU is 58%, so if we cut all of those CPUs in half we would either have to be OK with a small degradation in performance, or maybe we select an instance to avoid exceeding 99% peak CPU (which happens around the 93rd percentile – still a pretty massive number).
Looking down at the 75th and 50th percentiles we see instances that could possibly benefit from multiple steps down! As shown in the next section, one step down can save you 50% of the cost for an instance. Two steps down can save you 75%!
Before making an actual server sizing change, this data would need to be further analyzed on an instance by instance basis – it may be that many of these instances have bursty behavior, where their CPUs are more highly utilized for short periods of time, and idle all the rest of the time. Such an instance would be better off being parked or stopped for most of the time, and only started up when needed. Or…depending on the duration and magnitude of the burst, might be better off moving to the AWS T instance family, which accumulates credits for bursts of CPU, and is less expensive than the M family, built for a more continuous performance duty cycle. Also – as discussed below – complete rightsizing would entail looking at some other utilization stats as well, like memory, network, etc.
On every cloud provider there is a clear progression of server sizing and prices within any given instance family. The next size up from where you are is usually twice the CPUs and twice the memory, and as might be expected, twice the price.
Here is a small sample of AWS prices in us-east-1 (N. Virginia) to show you what I mean:
Double the memory and/or double the CPU…and double the price.
It is important to note that there is more to instance utilization than just the CPU stats. There are a number of applications with low-CPU but high network, memory, disk utilization, or database IOPs, and so a complete set of stats are needed before making a rightsizing decision.
This can be where rightsizing across instance families makes sense.
On AWS, some of the most commonly used instance types are the T and M general purpose families. Many production applications start out on the M family, as it has a good balance of CPU and memory. Let’s look at the m5.4xlarge as a specific example, shown in the middle row below.
If you find that such an instance was showing good utilization of its CPU, maybe with an Average CPU of 75% and Peak CPU of 95%, but the memory was extremely underutilized, maybe only consuming 20%, we may want to move to more of a compute-optimized instance family. From the table below, we can see we could move over to a c5.4xlarge, keeping the same number of CPUs, but cutting the RAM in half, saving about 11% of our costs.
On the other hand, if you find the CPU was significantly underutilized, for example showing an Average CPU of 30% and Peak of 45%, but memory was 85% utilized, we may be better off on a memory-optimized instance family. From the table below, we can move to an r5.2xlarge instance cutting the vCPUs in half, and keeping the same amount of RAM, and saving about 34% of the costs.
Within AWS there are additional considerations on the network side. As shown here, available network performance follows the instance size and type. You may find yourself in a situation where memory and CPU are non-issues, but high network bandwidth is critical, and deliberately super-size an instance. Even in this case, though, you should think about whether there is a way to split your workload into multiple smaller instances (and thus multiple network streams) that are less expensive than a beastly machine selected solely on the basis of network performance.
You may also need to consider availability when determining your server sizing. For example, if you need to run in a high-availability mode using an autoscaling group you may be running two instances, either one of which can handle your full load, but both are only 50% active at any given time. As long as they are only 50% active that is fine – but you may want to consider if maybe two instances at half the size would be OK, and then address a surge in load by scaling-up the autoscaling group.
For full cost optimization for your virtual machines, you need to consider appropriate resource scheduling, server sizing, and sustained usage.
Rightsize instances wherever possible. You can easily save 50% just by going down one size tier – and this applies to production resources as well as development and test systems!
Modernize your instance types. This is similar to rightsizing, in that you are changing to the same instance type in a newer generation of the same family, where cloud provider efficiency improvements mean lower costs. For example, moving an application from an m3.xlarge to an m5.xlarge can save 28%!
Park/stop instances when they are not in use. You can save 65% of the cost of a development or test virtual machine by just having it on 12 hours per day on weekdays!
For systems that must be up continually, (and once you have settled on the correct size instance) consider purchasing reserved instances, which can save 54-75% off the regular cost. If you would like a review of your resource usage to see where you can best utilized reserved instances, please let us know.
Last week, ParkMyCloud released the ability to rightsize and modernize instances. This release helps you identify the virtual machine and database instances that are not fully utilized or on an older family, making smart recommendations for better server sizing and/or family selection for the level of utilization, and then letting you execute the rightsize action. We will also be adding a feature for scheduled rightsizing, allowing you to maintain instance continuity, but reducing its size during periods of lower utilization.