8 Ways to Improve Cloud Automation Through Tagging

8 Ways to Improve Cloud Automation Through Tagging

Since the beginning of public cloud, users have been attempting to improve cloud automation. This can be driven by laziness, scale, organizational mandate, or some combination of those. Since the rise of DevOps practices and principles, this “automate everything” approach has become even more popular, as it’s one of the main pillars of DevOps. One of the ways you can help sort, filter, and automate your cloud environment is to utilize tags on your cloud resources.

Tagging Methodologies

In the cloud infrastructure world, tags are labels or identifiers that are attached to your instances. This is a way for you to provide custom metadata to accompany the existing metadata, such as instance family and size, region, VPC, IP information, and more. Tags are created as key/value pairs, although the value is optional if you just want to use the key. For instance, your key could be “Department” with a value of “Finance”, or you could have a key of just “Finance”.

There are 4 general tag categories, as laid out in the best practices from AWS:

  1. Technical – This often includes things like the application that is running on the resource, what cluster it belongs to, or which environment it’s running in (such as “dev” or “staging”).
  2. Automation – These tags are read by automated software, and can include things like dates for when to decommission the resource, a flag for opting in or out of a service, or what version of a script or package to install.
  3. Business and billing – Companies with lots of resources need to track which department or user owns a resource for billing purposes, which customer an instance is serving, or some sort of tracking ID or internal asset management tag.
  4. Security – Tags can help with compliance and information security, as well as with access controls for users and roles who may be listing and accessing resources.

In general, more tags are better, even if you aren’t actively using those tags just yet. Planning ahead for ways you might search through or group instances and resources can help save headaches down the line. You should also ensure that you standardize your tags by being consistent with the capitalization/spelling and limiting the scope of both the keys and the values for those keys. Using management and provisioning tools like Terraform or Ansible can automate and maintain your tagging standards.

Automation Methodologies

Once you’ve got your tagging system implemented and your resources labeled properly, you can really dive into your cloud automation strategy. Many different automation tools can read these tags and utilize them, but here are a few ideas to help make your life better:

  1. Configuration Management – Tools like Chef, Puppet, Ansible, and Salt are often used for installing and configuring systems once they are provisioned. This can determine which settings to change or configuration bundles to run on the instances.
  2. Cost Control – this is the automation area we focus on at ParkMyCloud – our platform’s automated policies can read the tags on servers, scale groups, and databases to determine which schedule to apply and which team to assign the resource to, among other actions.
  3. CI/CD – If your build tool (like Jenkins or Bamboo) is set to provision or utilize cloud resources for the build or deployment, you can use tags for the build number or code repository to help with the continuous integration or continuous delivery.
  4. Cloud Account Clean-up – Scripts and tools that help keep your account tidy can use tags that set an end date for the resource as a way to ensure that only necessary systems are around long-term. You can also take steps to automatically shut down or terminate instances that aren’t properly tagged, so you know your resources won’t be orphaned.

Conclusion: Tagging Will Improve Your Cloud Automation

As your cloud use grows, implementing cloud automation will be a crucial piece of your infrastructure management. Utilizing tags not only helps with human sorting and searching, but also with automated tasks and scripts. If you’re not already tagging your systems, having a strategy on the tagging and the automation can save you both time and money.

Google Cloud Machine Types Comparison

Google Cloud Machine Types Comparison

Google Cloud Platform offers a range of machine types optimized to meet various needs. Machine types provide virtual hardware resources that vary by virtual CPU (vCPU), disk capability, and memory size, giving you a breadth of options. But with so much to choose from, finding the right Google Cloud machine type for your workload can get complicated.

In the spirit of our recent blog on EC2 instance types, we’re doing an overview of each Google Cloud machine type. This image shows the basics of what we will cover, but remember that you’ll want to investigate further to find the right machine type for your particular needs.

Predefined Machine Types

Predefined machine types are a fixed pool of resources managed by Google Compute Engine. They come in five “classes” or categories:

Standard (n1-standard)

Standard machine types work well with workloads that require a balance of CPU and memory. The n1-standard family of machine types come with 3.75 GB of memory per vCPU. There are 8 total in the series and they range from 3.75 to 360 GB of memory, corresponding accordingly with 1 to 96 vCPU.

High-Memory (n1-highmem)

High memory machine types work for just what you’d think they would – tasks that require more system memory as opposed to vCPUs. The n1-highmem family comes with 6.50 GB of memory per vCPU, offering 7 total varieties ranging from 13 to 624 GB in memory, corresponding accordingly with 2 to 96 vCPUs.

High-CPU (n1-highpcu)

If you’re looking for the most compute power, the n1-highcpu series is the way to go, offering 0.90 GB per vCPU. There are 7 options within the high cpu machine type family, ranging from 1.80 to 86.6GB and 2 to 96 vCPUS.

Shared-Core (f1-micro)

Share-core machine types are cost-effective and work well with small or batch workloads that only need to run for a short time. They provide a single vCPU that runs on one hyper-thread of the host CPU running your instance.

The f1-micro machine type family provides bursts of physical CPU for brief periods of time in moments of need. They’re like spikes in compute power that can only happen in the event that your workload requires more CPU than you had allocated. These bursts are only possible periodically and are not permanent.

Memory Optimized (n1-ultramem or n1-megamem)

For more intense workloads that require high memory but also more vCPU than that you’d get with the high-memory machine types, memory-optimized machine types are ideal. With more than 14 GB of memory per vCPU, Google suggests that you choose memory-optimized machine types for in-memory databases and analytics, genomics analysis, SQL analysis services, and more. These machine types are available based on zone and region.

Custom Machine Types

Predefined machine types vary to meet needs based on high memory, high vCPU, a balance of both, or both high memory and high vCPU. If that’s not enough to meet your needs, Google has one more option for you – custom machine types. With custom machine types, you can define exactly how many vCPUs you need and what amount of system memory for the instance. They’re a great fit if your workloads don’t quite match up with any of the available predefined types, or if you need more compute power or more memory, but don’t want to get bogged down by upgrades you don’t need that come with predefined types.

About GPUs and machine types

On top of your virtual machine instances, Google also offers graphics processing units (GPUs) that can be used to boost workloads for processes like machine learning and data processing. GPUs typically can only be attached to predefined machine types, but in some cases can also be placed with custom machine types depending on zone availability. In general, the higher number of GPUs attached to your instances, the higher number of vCPUs and system memory available to you.

What Google Cloud machine type should you use?

Between the predefined options and the ability to create custom Google Cloud machine types, Google offers enough variety for almost any application. Cost matters, but with the new resource-based pricing structure, the actual machine you chose matters less when it comes to pricing.

With good insight into your workload, usage trends, and business needs, you have the resources available to find the machine type that’s right for you.

Should Your Company Adopt Google’s Site Reliability Engineering Approach?

Should Your Company Adopt Google’s Site Reliability Engineering Approach?

Over the past year or so, we have spoken with quite a few prospective users who have defined their responsibilities as site reliability engineering (SRE). If, like me, you’re not familiar with the term, I’ll save you the Google search. SRE is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. Practitioners aim to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” And its origins can also be traced back to 2003 and Google when Ben was hired to lead software engineers to run a production environment.

The site reliability engineering footprint at Google is now larger than 1,500 engineers. Many products have small to medium sized SRE teams supporting them, though not all products do. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm, including ServiceNow, Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, Target, IBM, Xero, Oracle, Zalando, Acquia, and GitHub.

The people we talk to on a daily basis are typically charged with operational management of their company’s cloud infrastructure, and thus governing and controlling costs (that’s where we come in). I got to wondering, how is this approached differently by, say, a site reliability engineer vs. someone who labels himself as “DevOps”?

How Does Site Reliability Engineering Compare to DevOps?

In simple terms, the difference between SREs and DevOps seems clear based on our conversations with folks. SREs are engineers focused on production environments, while DevOps is a philosophy as well as a role. DevOps folks are definitely less concerned with production vs. non-production, and more concerned with the overall cloud management and operations. Side note, DevOps was coined around 2008, so an SRE actually predates a DevOps engineer.

A site reliability engineer (SRE) will spend up to 50% of their time doing “ops” related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a highly skilled system administrator with knowledge of code and automation.

When I first encountered it, site reliability engineering just seemed like another buzzword to replace “IT” or “Ops”. As I read more on it, I understand that it’s more about the people and the process and less about the technology. There is rarely a mention of the underlying infrastructure or tools, and it seems like the main requirement is just the desire to improve. With that, you can align your development and operations (funny, right – DevOps) around the discipline of SRE.

Should Your Company Implement a Site Reliability Engineering Approach?

So while all the hype is around implementing DevOps in your organization, should you really be adopting the idea of site reliability engineering? It certainly makes sense based on the name alone, as “site reliability” is synonymous with “business availability” in our modern internet-connected culture. Any downtime for your service or application means lost revenue and dissatisfied customers, which means the business takes a hit. Using site reliability engineering to keep things running smoothly, while employing DevOps principles to improve those smooth-running processes, seems to be the best combination to really empower your company.

Google Cloud Discounts Just Got Better For You with Resource-Based Pricing

Google Cloud Discounts Just Got Better For You with Resource-Based Pricing

Among the many announcements made at Google Cloud Next last week was a new option for Google Cloud discounts: resource-based pricing.

This new option, which Google will roll out in the fall, expands their idea of “pay per use pricing”. For resource types n1-standard, n1-highmem, and n1-highcpu, Google will no longer charge based on machine types. Instead, they will now aggregate across resources and charge based on the quantity of vCPU and GB of memory you use.

This new addition to the family of Google Cloud discounts will have its biggest effect on Sustained Use discounts.

Sustained Use Discounts are Even Better

We recently asked the question, do sustained use discounts really save you money? The answer was yes, although depending on the situation, perhaps not the maximum amount of money possible.

With the resource-based pricing change, Sustained Use Discounts will be based in regions instead of just zones, so you can rack up “percentage of the month” usage and therefore discounts faster and easier. For example, if you have a single busy week in the month, during which you run several VMs with varying amounts of vCPU, the vCPU will all be counted together before the sustained use amount is calculated, giving you potential for a better-optimized discount.

For some customers, the biggest impact of this change will be in Autoscaling Managed Instance Groups. In the old system, if a group of instances scaled up and down over time (especially daily), the new instances that were created and then shut down a short time later never had an opportunity to accumulate enough hours to reach a sustained use discount tier. In the new system, the aggregated use of these systems counts toward the sustained use, giving a much higher likelihood of getting the Sustained Use Discount.

Billing Simplicity (…Hopefully)

While this should make your bill lower, it may not make your bill “easier to understand” as Google claims. Since discounts will apply at a regional level, and there’s now yet another step going on behind the scenes to calculate your bill, some users may find it harder to predict their monthly bills. You will no longer be able to see the machine types that you are using in your invoice, although you can obtain them via Billing BigQuery. Keep this in mind, and be sure to dig into your first few invoices after the change is made to see how it’s affecting your particular environment.

It’s All About Automation

One thing we appreciate about the change is that Google Cloud customers do not need to take any action to receive these discounts – it’s all done automatically. The same has always been true for Sustained Use Discounts, something that makes Google Cloud stand out from its immediate competition – neither AWS nor Azure offers something directly equivalent.

Google Cloud Discounts are Good for the Customer

Here’s what people are saying about the update.

It shows flexibility as a priority:

It’s pro-customer:

And AWS would be wise to do the same:

We’re glad to see another addition to the Google Cloud discounts that go directly toward improving the customer experience. It’s clear to see that GCP is focusing on a customer-first experience – which is good news for all of us.

Want tips, tricks, and insights for an optimized cloud?

> No, I like wasting time and money.