Over the past year or so, we have spoken with quite a few prospective users who have defined their responsibilities as site reliability engineering (SRE). If, like me, you’re not familiar with the term, I’ll save you the Google search. SRE is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. Practitioners aim to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” And its origins can also be traced back to 2003 and Google when Ben was hired to lead software engineers to run a production environment.
The site reliability engineering footprint at Google is now larger than 1,500 engineers. Many products have small to medium sized SRE teams supporting them, though not all products do. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm, including ServiceNow, Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, Target, IBM, Xero, Oracle, Zalando, Acquia, and GitHub.
The people we talk to on a daily basis are typically charged with operational management of their company’s cloud infrastructure, and thus governing and controlling costs (that’s where we come in). I got to wondering, how is this approached differently by, say, a site reliability engineer vs. someone who labels himself as “DevOps”?
How Does Site Reliability Engineering Compare to DevOps?
In simple terms, the difference between SREs and DevOps seems clear based on our conversations with folks. SREs are engineers focused on production environments, while DevOps is a philosophy as well as a role. DevOps folks are definitely less concerned with production vs. non-production, and more concerned with the overall cloud management and operations. Side note, DevOps was coined around 2008, so an SRE actually predates a DevOps engineer.
A site reliability engineer (SRE) will spend up to 50% of their time doing “ops” related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a highly skilled system administrator with knowledge of code and automation.
When I first encountered it, site reliability engineering just seemed like another buzzword to replace “IT” or “Ops”. As I read more on it, I understand that it’s more about the people and the process and less about the technology. There is rarely a mention of the underlying infrastructure or tools, and it seems like the main requirement is just the desire to improve. With that, you can align your development and operations (funny, right – DevOps) around the discipline of SRE.
Should Your Company Implement a Site Reliability Engineering Approach?
So while all the hype is around implementing DevOps in your organization, should you really be adopting the idea of site reliability engineering? It certainly makes sense based on the name alone, as “site reliability” is synonymous with “business availability” in our modern internet-connected culture. Any downtime for your service or application means lost revenue and dissatisfied customers, which means the business takes a hit. Using site reliability engineering to keep things running smoothly, while employing DevOps principles to improve those smooth-running processes, seems to be the best combination to really empower your company.
ParkMyCloud just turned 3 years old, and from here, the future looks great. The market is growing, cloud is the norm, and cost control is always top of mind for companies big and small. In fact, over 600 enterprises in 25+ countries now use our platform to “park idle cloud resources (including instances, databases and scale groups) in AWS, Azure, GCP and now Alibaba.
As we look to the future, we’re taking a moment to consider current cloud trends and how cost control needs are changing. To provide context, let’s take a quick look at where the market was three years ago.
The Problem that Got Us Started
When we founded the company three years ago, we set out to build a self-service, SaaS platform which would allow DevOps users to automate cloud cost control and integrate it into their cloud operations. We saw a need for this platform as we were talking to enterprises using AWS about broader cloud management needs as a service play. They wanted a self-service, purpose-built easy button for instance scheduling that could be centrally managed and governed but left up to the end user to control – enter ParkMyCloud.
Our value proposition started simply and has stayed relatively constant: save 20% on your cloud bill in 15 minutes or less (it’s 65% per parked resource). The ease of use, verifiable ROI, and richness of our platform capabilities allow global companies like McDonald’s, Unilever, Sysco, Sage and many others to adopt ParkMyCloud on their own, with no services, and begin to automate their cloud cost control in minutes – not days or weeks.
I went back and looked at our pre-launch pitch decks. At that time, the cloud Infrastructure-as-a-Service (IaaS) market was $10B or so, and dominated by AWS, and others like Rackspace and HP were in the game with the other usual suspects. Today, Gartner estimates enterprises will spend $41B on IaaS in 2018, and it’s still dominated by AWS, but the number of players is really down to 4 or 6 depending on where you want to put IBM and Oracle.
But the cloud waste problem is still prominent and growing, most analysts and industry pundits estimate that 25% or more of your bill is wasted on unused, idle or over provisioned resources – that equates to $10B+ based on 2018 IaaS predictions being wasted – that’s a BIG nut. In fact, if you break that down that’s $1MM in wasted cloud spend every hour. And it’s important. Most enterprises rank cloud security/governance and cost management as their primary concerns with cloud adoption.
Cloud Trends Driving the Market
So how are things changing? We see three key trends that will drive our company and platform vision over the next 3 years:
- Multi-cloud – it’s been long discussed, but it’s now a reality: 20% of the enterprises using PMC manage 2 or more CSPs in the platform, and that number is growing. As always, cost control is an important factor in a multi-cloud strategy.
- PaaS – Platform as a Service (PaaS) use is growing, so users are looking to optimize these resources. ParkMyCloud offers optimization for databases, scale groups, and logical groups. We plan to expand into containers and stacks to meet this need.
- Data-driven automation (AIOps) – our customers, large and small, are pushing us to expand our data-driven policies and automation – everyone is becoming more comfortable with the idea of automation. Our first priority on this front is to optimize overprovisioned resources – often referred to as RightSizing … RightSizeMyCloud!
Cloud trends are not always easy to predict, but one thing is for certain: costs will need to be controlled. Good fun ahead.
We’re happy to announce that ParkMyCloud now supports Alibaba Cloud!
Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) users have saved millions of dollars on their cloud bills using ParkMyCloud’s automated cloud cost optimization platform. Customers like McDonald’s, Sysco, and Unilever use ParkMyCloud to automatically turn off idle cloud resources as part of their DevOps process.
Now, Alibaba Cloud customers can do the same.
Alibaba Cloud is experiencing rapid customer adoption and growth – in the 4th quarter of last year, they saw over 100% growth, with more than 300 products and features launched. The company is clearly expanding their horizons beyond retail and putting a focus on innovation and development in the cloud space – both in China where their core customer base is located, and throughout the world as companies globally choose Alibaba as their primary cloud provider or as part of a multi-cloud strategy.
But the real reason we’re here is to help cloud users solve the enormous problem of cloud waste.
We estimate that Alibaba users will waste $552 million on idle cloud resources this year – that’s $1.5 million per day that could easily be saved with automated cost optimization in place. There’s no time to lose in getting cost control measures in place.
See it In Action
Get a preview of ParkMyCloud – watch this 2-minute demo to see how it works. To see a full demo and get your questions answered, schedule a personalized demo now.
Try Now for Free
You can get started with Alibaba Cloud cost control now with a free 14-day trial of ParkMyCloud, with full access to premium features.
After your trial expires, you can choose to continue using the free tier, or upgrade to use premium features such as SmartParking, full API access, advanced reporting and SSO.
Cheers, and happy parking.
We have been talking about idle cloud resources for several years now. Typically, we’re talking about instances purchased On Demand that you’re using for non-production purposes like development, testing, QA, staging, etc. These resources can be “parked” when they’re not being used, such as on nights and weekend, saving 65% or more per resource each month. What we haven’t talked much about is how the problem of idle cloud resources extends beyond just your typical virtual machine.
Why Idle Cloud Resources are a Problem
If you think about it, the problem is pretty straightforward: if a resource is idle, you’re paying your cloud provider for something you’re not actually using. This adds up.
Most non-production resources can be parked about 65% of the time, that is, parked 12 hours per day and all day on weekends (this is confirmed by looking at the resources parked in ParkMyCloud – they’re scheduled to be off just under 65% of the time.) We see that our customers are paying their cloud providers an average list price of $220 per month for their instances. If you’re currently paying $220 per month for an instance and leaving it running all the time, that means you’re wasting $143 per instance per month.
Maybe that doesn’t sound like much. But if that’s the case for 10 instances, you’re wasting $1,430 per month. One hundred instances? You’re up to a bill of $14,300 for time you’re not using. And that’s just a simple micro example. At a macro level that’s literally billions of dollars in wasted cloud spend.
4 Types of Idle Cloud Resources
So what kinds of resources are typically left idle, consuming your budget? Let’s dig into that, looking at the big three cloud providers — Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
- On Demand Instances/VMs – this is the core of the conversation, and what we’ve addressed above. On demand resources – and their associated scale groups – are frequently left running when they’re not being used, especially those used for non-production purposes.
- Relational Databases – there’s no doubt that databases are frequently left running when not needed as well, in similar circumstances to the On Demand resources. The problem is whether you can park them to cut back on wasted spend. AWS allows you to park certain types of its RDS service, however, you can not park like idle database services in Azure (SQL Database) or GCP (SQL). In this case, you should review your database infrastructure regularly and terminate anything unnecessary – or change to a smaller size if possible.
- Load Balancers – AWS Elastic Load Balancers (ELB) cannot be stopped (or parked), so to avoid getting billed for the time you need to remove it. The same can be said for Azure Load Balancer and GCP Load Balancers. Alerts can be set up in Cloudwatch/Azure Metrics/Google Stackdriver when you have a load balancer with no instances, so be sure to make use of those alerts.
- Containers – optimizing container use is a project of its own, but there’s no doubt that container services can be a source of waste. In fact, we are evaluating the ability for ParkMyCloud to park container services including ECS and EKS from AWS, ACS and AKS from Azure, and GKE from GCP, and the ability to prune and park the underlying hosts. In the meantime, you’ll want to regularly review the usage of your containers and the utilization of the infrastructure, especially in non-production environments.
Cloud waste is a billion-dollar problem facing businesses today. Make sure you’re turning off idle cloud resources in your environment, by parking those that can be stopped and eliminating those that can’t, to do your part in optimizing cloud spend.
The time is ripe to take a fresh look at the advantages of multi-cloud. In the past 12 months, we’ve seen a huge increase in the number of our customers who use multiple public clouds – now more than 20% of our customers use multiple public clouds. With this trend in mind, we wanted to take a look at the positives of a multi-cloud strategy as well as the risks – because of course there’s no “easy button.”
What is Multi-Cloud?
First off, let’s define multi-cloud. Clearly, we’re talking about using one or more clouds, but clouds come in different flavors. For example, multi-cloud incorporates the idea of hybrid cloud – a mix of public and private Clouds. But multi-cloud can also mean two or more public clouds or two or more private clouds.
According to the RightScale 2018 State of the Cloud Report, 81% of Enterprises have a multi-cloud strategy:
What are the advantages of multi-cloud?
So why are businesses heading this direction with their infrastructure? Simple reasons include the following:
- Risk Mitigation – create resilient architectures
- Managing vendor lock-in – get price protection
- Optimization – place your workloads to optimize for cost and performance
- Cloud providers’ unique capabilities – take advantage of offerings in AI, IOT, Machine Learning, and more
When I asked our CTO what he sees as the advantages of a multi-cloud strategy, he highlighted risk management. ParkMyCloud’s own platform was born in the cloud, we run on AWS, we have a multi-region architecture with redundancy (let’s call this multi-cloud ‘light’), and if we went multi-cloud we would leverage another public cloud for risk mitigation.
Specifically, risk management from the perspective of one vendor having an infrastructure meltdown or attack. AWS had an issue about 15 months ago when S3 was offline in US-East-1 region for 5+ hours affecting many companies, large and small, and software from web apps to smartphones apps were affected (including ours). There have also been issues of certain AWS regions getting a DDoS attack that have affected service availability.
Having a backup to another cloud service provider (CSP) or Private Cloud in these cases could have ensured 100% uptime. In the case of Alibaba and other cloud vendors, they may have a much stronger presence in certain geographic regions due to a long term presence. When any of the vendors just start getting a toe-hold in a region, their environment has minimal redundancy and safeguards in place that provide the desired high-availability, so another provider in the same region may be safer from that availability perspective.
Do the advantages of multi-cloud outweigh the challenges?
Now let’s say you want to go multi-cloud, what does this mean to you? From our own experience integrating with AWS, Azure, and Google Cloud, we’ve seen that each cloud has its own set of interfaces and own challenges. It is not a “write once, runs everywhere” situation between the vendors, and any cloud or network management utility system needs to do the work to provide deep integration with each CSP.
Further, the nuances of configuring and managing each CSP require both broad and deep knowledge, and it is rare to find employees with the essential expertise for multiple clouds – so more staff is needed to manage multi-cloud with confidence that it is being done in a way that is both secure and highly available. With everyone trying to play catch-up with AWS, and with AWS itself evolving at a breakneck pace, it is very difficult for an individual or organization to best utilize one CSP, let alone multiple clouds.
Things like a common container environment can help mitigate these issues somewhat by isolating engineers from the nuances of virtual machine management, but the issues of network, infrastructure, cost optimization, security, and availability remain very CSP-specific.
On paper there are advantages of having a multi-cloud strategy. In practice, like many things, it ain’t easy.