Today we have news that both Finance and DevOps folks will appreciate to improve cloud cost governance: ParkMyCloud and CloudHealth have taken our partnership a step further with a first-of-its-kind technical integration. Our products now work together to give you a seamless cloud management experience, with a single place to go for multi-cloud cost management, reporting, and governance. Our goal is to save you time and money, and to improve financial accountability and management processes.
Customers in software, biotechnology, and education have tried it out — and are saving an average of $25,000 per month on their cloud bills and the feedback has been great. They say it’s rare to find integrations between the major platforms they use throughout the day, and this setup is unique.
Melanie Metcalfe, Director of Project Support at Foster Moore, said, “what we need to manage and optimize our cloud environments is cost control, user governance, and detailed reporting. It makes our cloud operations simpler and easier when solutions from different vendors are integrated out of the box, and we’re glad to see CloudHealth and ParkMyCloud making this a reality.“
Here’s what a typical use case might look like if you’re a user of both products:
- You log in to your CloudHealth account and take a quick look at your AWS dashboard.
- You navigate to Pulse -> HealthCheck to find all possible optimizations in your environment.
- On the list, you see ParkMyCloud, indicating that you have savings potential.
- You click that to check out your list of EC2 instances, and find a few with a ParkMyCloud icon to show they’re recommended to park.
- What does this mean? ParkMyCloud has analyzed your resource utilization patterns and automatically created an optimized on/off schedule that can save you money. You just need to apply it.
- You click the ParkMyCloud icon, which takes you to your ParkMyCloud recommendations screen to take action. You can click to accept the parking schedule as is, or modify it (including the option to be more conservative or more aggressive.)
- You go back to check out your CloudHealth reports, which include the data from your ParkMyCloud savings – all able to break down by environment, app, team, and more, for better visibility and cloud cost governance.
The integration is especially exciting as it continues the momentum in the multi-cloud management space kicked off by last week’s news that VMware will acquire CloudHealth to provide multi-cloud operations at a global scale — congrats to the whole team.
Learn more about the ParkMyCloud/CloudHealth integration and partnership on this page. Interested in seeing a demo of this cloud cost governance solution? Schedule a demo here.
The next plain on the cost optimization frontier for ParkMyCloud is cloud sizing. We have been working on product features around resource sizing that will deliver greater automation in the management of cloud infrastructure. A key part of this effort has involved analysis of cloud usage patterns across our entire user base. We’ve identified some interesting patterns and correlations in cloud sizing and usage.
vCPU Utilization Patterns: Lower than Expected
One data point that caught our attention was vCPU metric data, specifically the very low average (and peak) utilization we see in our users’ infrastructure. We know anecdotally that a large proportion of what users manage in our platform consists of non-production instances used for development, staging, testing, and data analytics workloads, many of which do not need to run 24/7/365. But even bearing this in mind, we see a surprisingly low vCPU utilization. Based on our most recent analysis of instances from across the four public cloud providers we support, some 50% of instances had an average vCPU of only 2% and a peak of 55%. Even at the 75th percentile, average utilization was only 7%, albeit with a peak of 98%.
What leads to these cloud sizing decisions?
Of course, when selecting instance sizes and types, vCPU is not the only consideration. To make an accurate assessment of the match between workload and instance type, there are several data points to consider, including memory, network, disk, etc. We have no visibility into the specific workloads on these instances and why they were chosen, but we can make some educated guesses about why this systematic overprovisioning of instances is occurring.
A few potential reasons include:
- A need to provision instances with larger vCPUs in order to access instances with the required memory
- A need to provision larger storage-optimized instances where the focus is is high data IOPS
- Using some other ‘rule of thumb’ when provisioning such as the not-so-tried-and-tested ‘determine what I think I need then double it’ rule.
Clearly, there are a number of options which drive the performance and cost of cloud instances (VMs) including: the number of processor cores; the amount of RAM, storage capacity and storage performance, etc. Just focusing on one of these factors might not be overly useful, other than that we observe such extreme underutilization of one of these key components.
How much do cloud sizing choices matter?
Given the sheer volume of workloads moving to public cloud — some 80% of enterprises reported moving workloads to cloud in 2017 — it is critical to accurately determine, monitor and then optimize your compute resources is critical. If you think there’s a problem with improper cloud sizing in your environment, you may want to check out our recently published cloud waste checklist to identify other problem areas and take action to reduce costs.
There are many reasons why this “supersize me” approach to cloud sizing is occurring. We would be interested to get your take. How does your team determine compute requirements for cloud workloads? Are there other reasons why you might deliberately choose to oversize a resource? Comment below to let us know.
Alibaba Cloud is growing at an amazing rate, recently claiming to have overtaken both Google and IBM as the #3 public cloud provider globally, and certainly the #1 provider in China. Many sites and services hosted outside China are accessible from within China, but can suffer high latency and potentially lost functionality if their web interface requires interaction with blocked social media systems. As such, it is no surprise that a number of our (non-Chinese) customers have expressed interest in actually running virtual machine Alibaba instances in China. In this blog we are going to outline the process…and give an alternate plan.
General Process to Run Alibaba Instances in China
The steps to roll-out a deployment on Alibaba in mainland China are relatively clear:
- Establish a “legal commercial entity” in Mainland China.
- Select what services you want to run on Alibaba Cloud
- Apply for Internet Content Provider (ICP) certification
The first three steps are described in more detail below.
Establish a Legal Commercial Entity
Or putting it another way – you need to have an office in China. This can range from an actual office with your own employees, to a Joint Venture, which is a legal LLC between your organization and an established Chinese company. If your service is more informational in nature and is not actually selling anything via the service, then this can be relatively easy, taking only a couple weeks (at least for the legal side), though you will still need to find a Joint Venture partner and make the deal worth their while financially. For commerce or trade-related services, the complexity, time requirements, and costs start going up significantly.
What to run on Alibaba Cloud
There is a decision-point here, as there is one set of rules for Alibaba-hosted web/app servers, and additional rules for everything else. Base virtual machines, databases and other such core IT building blocks require the ICP registration described below, plus “real-name registration”, where a passport is needed to actually confirm the identity of whomever is purchasing the resource. If all you need is a web server, then you can skip this step. In either case, some of the filing requirements involve having a server and/or DNS record prepared in order to complete the later steps. A web site does not need to be completely finished until launch, but a placeholder may be needed.
Internet Content Provider (ICP) certification
There are two flavors of ICP certification:
- A “simple” ICP Filing – which is the bare minimum needed for informational websites that are not directly generating revenue.
- ICP Commercial Filing – This starts with getting an approved ICP Filing, and then also includes a Commercial License that must be obtained a province/municipality in China. In some cases, this appears to be related to which Alibaba region you are using, and even the physical location of your public IP address.
Many references recommend finding an experienced consultant to guide you through these processes, and it is easy to see why!
OK…WAY too much work. What is Plan B?
The other way to run Alibaba instances in China is to host your site or services in Hong Kong. All of the rules described above apply to “Mainland China”, which does not include Hong Kong. Taiwan is also not included in Mainland China, but Hong Kong has the advantage of being better connected to the rest of China. If the main problem you are trying to solve is to reduce latency to your site for China-based customers, Hong Kong is the closest you can get without actually being there, and Alibaba appears to do a pretty good job optimizing the Hong Kong experience. No local office or legal filings required!
Once you are all set up: Optimize your Costs!
After your instances are set up, make sure you’re optimizing Alibaba costs. Our Mainland China-based customers using Alibaba have confirmed that ParkMyCloud is able to access the Alibaba APIs from our US-based servers – so you can go ahead and try it out.
Earlier this week we discussed ways to improve cloud automation through tagging. Today, I want to extend the conversation to look at how one ParkMyCloud user is applying tagging best practices to improve their cloud governance.
The company we talked to — they’re in media, so let’s call them MediaCorp — has about 10,000 employees, which means the Cloud Engineering team has several hundred cloud users to manage, with a combined 100+ AWS accounts and more than 5,000 active AWS resources. The only way they can maintain security and cost control in a cloud environment of this magnitude is through automated governance. Here’s how they do it.
Tagging Best Practices #1: Always Tag
MediaCorp has a strict policy: every AWS resource must have the same set of five tags attached to it:
- team — essential to establishing ownership of the resource, both for maintenance and for billing
- environment — knowing whether the resource is for production, staging, or QA has implications for on/off schedules
- application — MediaCorp uses this as a trigger for Chef Cookbooks, but can also apply to billing
- expiration date — Any non-production resource has a stated expiration date to prevent orphaned resources
- cost center — The finance department has internal billing codes for all IT resources
How does MediaCorp ensure that all resources are tagged?
Tagging Best Practices #2: Automated Compliance
The key is to use automated rules to enforce that every resource has the five required tags — this is where ParkMyCloud’s policy engine comes into play. MediaCorp has a set of policies set up to check for the five tags. If a resource is missing any, the resource is immediately put on an “always parked” schedule and moved to a team (a way to group instances in ParkMyCloud) specifically for mistagged resources.
When this happens, the Cloud Engineering team gets an email and a Slack notification, so they can track down the creator of the offending resource and correct the process that created it.
Tagging Best Practices #3: Optimize Workflows
Now the tags themselves come into play. MediaCorp uses their five-tag system for three main purposes:
Configuration management: as mentioned above, they use tags as the trigger for Chef cookbooks, and of course the same applies to Puppet Modules, or Ansible Playbooks.
CI/CD: MediaCorp uses Jenkins to provision cloud resources, so they use tags to associate build and deployment servers with their corresponding repository and build number, for both automated and manual development tasks.
Cost control: the “environment” tag determines what parking schedule is applied to each resource. Production resources run 24×7, of course, while “dev” or “test” resources are put on a schedule to park 7:00 PM – 7:00 AM and on weekends. (Users can always log in to override these schedules if needed.)
Conclusion: Tagging is Worth the Effort
It may at first seem unnecessarily harsh to automatically park any resource that doesn’t have proper tags applied, but this process is what allows MediaCorp to keep a well-governed, cost-controlled infrastructure. You can always adapt their use case to your own needs by simply moving resources to another team and notifying that action is needed, without changing the state or schedule on the resource.
Either way, with a rigorous application of tagging best practices in place, you can automate governance and improve your workflows.
Since the beginning of public cloud, users have been attempting to improve cloud automation. This can be driven by laziness, scale, organizational mandate, or some combination of those. Since the rise of DevOps practices and principles, this “automate everything” approach has become even more popular, as it’s one of the main pillars of DevOps. One of the ways you can help sort, filter, and automate your cloud environment is to utilize tags on your cloud resources.
In the cloud infrastructure world, tags are labels or identifiers that are attached to your instances. This is a way for you to provide custom metadata to accompany the existing metadata, such as instance family and size, region, VPC, IP information, and more. Tags are created as key/value pairs, although the value is optional if you just want to use the key. For instance, your key could be “Department” with a value of “Finance”, or you could have a key of just “Finance”.
There are 4 general tag categories, as laid out in the best practices from AWS:
- Technical – This often includes things like the application that is running on the resource, what cluster it belongs to, or which environment it’s running in (such as “dev” or “staging”).
- Automation – These tags are read by automated software, and can include things like dates for when to decommission the resource, a flag for opting in or out of a service, or what version of a script or package to install.
- Business and billing – Companies with lots of resources need to track which department or user owns a resource for billing purposes, which customer an instance is serving, or some sort of tracking ID or internal asset management tag.
- Security – Tags can help with compliance and information security, as well as with access controls for users and roles who may be listing and accessing resources.
In general, more tags are better, even if you aren’t actively using those tags just yet. Planning ahead for ways you might search through or group instances and resources can help save headaches down the line. You should also ensure that you standardize your tags by being consistent with the capitalization/spelling and limiting the scope of both the keys and the values for those keys. Using management and provisioning tools like Terraform or Ansible can automate and maintain your tagging standards.
Once you’ve got your tagging system implemented and your resources labelled properly, you can really dive into your cloud automation strategy. Many different automation tools can read these tags and utilize them, but here are a few ideas to help make your life better:
- Configuration Management – Tools like Chef, Puppet, Ansible, and Salt are often used for installing and configuring systems once they are provisioned. This can determine which settings to change or configuration bundles to run on the instances.
- Cost Control – this is the automation area we focus on at ParkMyCloud – our platform’s automated policies can read the tags on servers, scale groups, and databases to determine which schedule to apply and which team to assign the resource to, among other actions.
- CI/CD – If your build tool (like Jenkins or Bamboo) is set to provision or utilize cloud resources for the build or deployment, you can use tags for the build number or code repository to help with the continuous integration or continuous delivery.
- Cloud Account Clean-up – Scripts and tools that help keep your account tidy can use tags that set an end date for the resource as a way to ensure that only necessary systems are around long-term. You can also take steps to automatically shut down or terminate instances that aren’t properly tagged, so you know your resources won’t be orphaned.
Conclusion: Tagging Will Improve Your Cloud Automation
As your cloud use grows, implementing cloud automation will be a crucial piece of your infrastructure management. Utilizing tags not only helps with human sorting and searching, but also with automated tasks and scripts. If you’re not already tagging your systems, having a strategy on the tagging and the automation can save you both time and money.
Google Cloud Platform offers a range of machine types optimized to meet various needs. Machine types provide virtual hardware resources that vary by virtual CPU (vCPU), disk capability, and memory size, giving you a breadth of options. But with so much to choose from, finding the right Google Cloud machine type for your workload can get complicated.
In the spirit of our recent blog on EC2 instance types, we’re doing an overview of each Google Cloud machine type. This image shows the basics of what we will cover, but remember that you’ll want to investigate further to find the right machine type for your particular needs.
Predefined Machine Types
Predefined machine types are a fixed pool of resources managed by Google Compute Engine. They come in five “classes” or categories:
Standard machine types work well with workloads that require a balance of CPU and memory. The n1-standard family of machine types come with 3.75 GB of memory per vCPU. There are 8 total in the series and they range from 3.75 to 360 GB of memory, corresponding accordingly with 1 to 96 vCPU.
High memory machine types work for just what you’d think they would – tasks that require more system memory as opposed to vCPUs. The n1-highmem family comes with 6.50 GB of memory per vCPU, offering 7 total varieties ranging from 13 to 624 GB in memory, corresponding accordingly with 2 to 96 vCPUs.
If you’re looking for the most compute power, the n1-highcpu series is the way to go, offering 0.90 GB per vCPU. There are 7 options within the high cpu machine type family, ranging from 1.80 to 86.6GB and 2 to 96 vCPUS.
Share-core machine types are cost-effective and work well with small or batch workloads that only need to run for a short time. They provide a single vCPU that runs on one hyper-thread of the host CPU running your instance.
The f1-micro machine type family provides bursts of physical CPU for brief periods of time in moments of need. They’re like spikes in compute power that can only happen in the event that your workload requires more CPU than you had allocated. These bursts are only possible periodically and are not permanent.
Memory Optimized (n1-ultramem or n1-megamem)
For more intense workloads that require high memory but also more vCPU than that you’d get with the high-memory machine types, memory-optimized machine types are ideal. With more than 14 GB of memory per vCPU, Google suggests that you choose memory-optimized machine types for in-memory databases and analytics, genomics analysis, SQL analysis services, and more. These machine types are available based on zone and region.
Custom Machine Types
Predefined machine types vary to meet needs based on high memory, high vCPU, a balance of both, or both high memory and high vCPU. If that’s not enough to meet your needs, Google has one more option for you – custom machine types. With custom machine types, you can define exactly how many vCPUs you need and what amount of system memory for the instance. They’re a great fit if your workloads don’t quite match up with any of the available predefined types, or if you need more compute power or more memory, but don’t want to get bogged down by upgrades you don’t need that come with predefined types.
About GPUs and machine types
On top of your virtual machine instances, Google also offers graphics processing units (GPUs) that can be used to boost workloads for processes like machine learning and data processing. GPUs typically can only be attached to predefined machine types, but in some cases can also be placed with custom machine types depending on zone availability. In general, the higher number of GPUs attached to your instances, the higher number of vCPUs and system memory available to you.
What Google Cloud machine type should you use?
Between the predefined options and the ability to create custom Google Cloud machine types, Google offers enough variety for almost any application. Cost matters, but with the new resource-based pricing structure, the actual machine you chose matters less when it comes to pricing.
With good insight into your workload, usage trends, and business needs, you have the resources available to find the machine type that’s right for you.