Have you been hearing a lot about Azure Databricks lately? We have. One of the nice things about talking with ParkMyCloud users is that we get to see trends often before they are more widely recognized within the industry. Whether it is adoption of new instances or databases, or usage of new tools and services it’s always interesting to see change occur.
What is Databricks?
One such change over the last year or so has been an enormous increase in the use of very short-lived instances, typically less than 60 minutes, which get spun up as part of clusters. These are in fact Databricks being used to undertake data analytics workloads. I had come across Databricks in relation to their unicorn status in the startup world – as of six months ago were valued at close to $4B – so I guess it was only a matter of time before we began to see the fruits of their labor become popular.
The Databricks story is an interesting one which begins at UC Berkeley with the development of a research project, Apache Spark in 2009. Apache Spark is described as a unified analytics engine for large-scale data processing. It provides an extremely rapid cluster computing technology, designed for fast computation. The team who developed Spark went on to found Databricks in 2013 since which time they have raised $500MM in funding.
The Databricks platform allows enterprises to build their data pipelines across data storage systems and prepare data sets for data scientists and engineers. To do this, Databricks offers a range of tools for building, managing and monitoring data pipelines. It enables the building of machine learning (ML) models, which have grown in parallel with the growth in big data within the enterprise.
The product also has an interesting approach to pricing with the introduction of their own usage-based billing methodology based on DBU’s. A DBU is a Databricks Unit (DBU) which is a unit of processing capability per hour, billed on per-second usage. This cost excludes the cost of the underlying instance (VM). The good thing is that the model is very transparent and provides a number of pricing options and tiers. Based on the tier and type of service required prices range from $0.07/DBU for their Standard product on the Data Engineering Light tier to $0.55 for the Premium product on the Data Analytics tier. Helpfully, they do offer online calculators for both Azure and AWS to help estimate cost including underlying infrastructure. The Azure Databricks pricing example can be seen here.
Databricks + Microsoft = Azure Databricks
A major breakthrough for the company was a unique partnership with Microsoft whereby their product is not just another item in the MS Azure Marketplace but rather is fully integrated into Azure with the ability to spin up Azure Databricks in the same way you would a virtual machine. Once running, the service can scale automatically as the users need change in the same way cloud is able to scale using autoscaling groups to match supply against demand.
Databricks are also available for other public cloud vendors, most notably AWS (available within the Marketplace). However, the level of integration is not the same as on Azure, and the service looks much more like a standard AWS marketplace offering.
Why More and More Companies are Using Azure Databricks
What is clear is that opportunities for use of ML and AI has progressed from experimentation to workloads, and these workloads are now at a massive scale. This has also been accompanied by the emergence of a new subset of DevOps called AIOps, which makes a lot of sense given the amount of infrastructure and services now needing to be configured and deployed to run such workloads.
In a forthcoming blog we will dig a little deeper in terms of the usage patterns for such workloads and the changes in terms of the way organizations running these workloads are now utilizing the public cloud for these non-production workloads.
AWS Firecracker was announced at AWS re:Invent in November 2018 as a new AWS open source virtualization technology. The technology is purpose-built for creating and managing secure, multi-tenant container and function-based services. It was described by the AWS Chief Evangelist Jeff Barr as “what a virtual machine would look like if it was designed for today’s world of containers and functions.”
What is AWS Firecracker?
Firecracker is a Virtual Machine Manager (VMM) exclusively designed for running transient and short-lived processes. In other words, it helps to optimize the running of functions and serverless workloads. It’s also an important new component in the emerging world of serverless technologies and is used to enhance the backend implementation of Lambda and Fargate. Firecracker helps deliver the speed of containers combined with the security of VMs. If you use Lambda or Fargate, you’re already receiving the benefits of Firecracker. However, if you run/orchestrate a large volume of containers, you should take a look at this service with optimization in mind.
How AWS Firecracker Creates Efficiencies
AWS can realize the economic benefits of Firecracker by creating what they call “microVMs”, which allows them to spread serverless workloads around multiple servers thus getting a greater ROI from its investment in the servers behind serverless. In terms of customer benefit, using Firecracker enables these new microVMs to launch in 125 milliseconds or less, compared to the seconds (or longer) it can take to launch a container or spin up a traditional virtual machine. In a world where thousands of VMs can be spun up and down to tackle a specific workload, this will constitute a significant savings. And remember, these are fully fledged micro virtual machines, not just containers.The micro VM’s themselves are worth a closer look as each includes an in-process rate limiter to optimize shared network and storage resources. As a result, one server can support thousands of microVMs with widely varying processor and memory configurations.\
There is also the enhanced security and workload isolation only available from Kernel-based Virtual Machine (KVMs) – more secure than containers, which are less isolated. One particularly valuable security feature is that Firecracker is statically linked, which means all the libraries it needs to run are included in its executable code. This makes new Firecracker environments safer by eliminating outside libraries. Altogether, this offering and the combination of efficiency, security and speed created quite the buzz at the AWS re:Invent launch.
Will Firecracker make a “bang”?
There are a few caveats related to the still novel aspects of the technology. In particular, compared to alternatives, such as containers or Hyper-V VMs, it is prudent to confine to non-production workloads as the technology is still new and needs to be more fully battle-tested for production use.
However, as confidence, adoption, and experience grow in the use of serverless technologies it certainly seems like Firecracker can offer a popular new method for provisioning compute resources and will likely help bridge the current gap between VMs and containers.
Once or twice a year we like to take a look at what is going on in the world of reserved instance pricing. We review both the latest offerings and options put out by cloud providers, as well as how users are choosing to use Reserved Instances (AWS), Reserved VMs (Azure) and Committed Use (Google Cloud).
A good place to start when it comes to usage patterns and trends is the annual Rightscale (Flexera) State of Cloud Report. The 2019 report shows that current reservation usage stands at 47% for AWS, 23% for Azure and 10 percent of GCP. These are some interesting data when you view them alongside companies overall reporting that their number one cloud initiative for the coming year is optimizing their existing use of the cloud. All of these cloud providers have a major focus on pre-selling infrastructure via their reservations programs as this provides them with predictable revenue (something much loved by Wall St) plus also allows them to plan for and match supply with demand. In return for an upfront commitment they offer discounts of‘up to 80%”, albeit much as your local furniture retailer has big saving headlines, these discount levels still warrant further investigation.
While working on an upcoming new feature release we began to dig a little deeper into the nature of current reserved instance pricing and discounts. From our research it appears that a real world discount level is in the 30%-50% range. To achieve some of the much higher level discounts you might see the cloud providers pushing, typically requires commitments of three years; being restricted to only certain regions; restrictions on OS types; and generally a willingness to commit to spending a few million dollars.
Reservation discounts, while not as volatile as spot instances, do change and need to be carefully monitored and analyzed. For example as of this writing, one of the more popular modern m5.large instance types in a US East Region costs $0.096 per hour when purchased on demand, but reduces to $0.037, a significant 62% saving. However, to secure such a discount requires a three-year commitment and prepayment in full up front. While the numbers of such organizations committing to contracts of this nature is not publicly known, it is likely that only the most confident of organizations with large cash reserves would be positioned to make a play like this.
Depending on the precise program used to purchase the reservations, there can be certain options to either convert specific instance families, instance types and OS’s for other types or even to resell the instances on a secondary exchange for a penalty fee of 12%, on AWS for example. Or to terminate the agreement for the same 12% fee on Azure. GCP’s Committed Use program seems to be the most stringent as there is no way to cancel the contract or resell pre-purchased instances, albeit Google does not offer a pre-purchase option.
As the challenge of optimizing cloud spend has slowly moved up the priority list to take the #1 slot, so has a maturation process taken place inside organizations when it comes to undertaking economic analysis and understanding the various tradeoffs. Some organizations are using tools to support such analysis, others are hiring consultants or using in house analytics resources. Whatever the approach in terms of analyzing an organization’s use of cloud, this typically requires looking at balancing the purchase of different types of reservations, spot instances or using on-demand infrastructure that is highly optimized through automation tools. Whatever the approach, the level of complexity in such analysis is certainly not reducing, and mistakes are common. However, the potential savings are significant if you achieve the right balance and is clearly something you should not ignore.
The relative balance between the different options to purchase and consume cloud services in many ways reflects the overall context within which organizations operate, their specific business models and broader macro issues such as the outlook for the overall economy. Understanding the breadth of options is key and although for most organizations, reservations are likely to be a key component it is worth digging into just how large the relative trade offs might be.
As container technology moves past something new and into the mainstream, users are concerned about the next step: container optimization. In our conversations with customers and potential customers, containers have been a consistent topic for the last few years, typically focused on production environments. However, recent conversations have become more focused, specifically on how to optimize container spending.
Kubernetes – which seems to be the most popular of container services among our customer base – does allow for a number of ways to optimize for costs and to maximize performance. We have identified five specific opportunities ripe for container optimization. Take a look at these within your own environments.
1) Rightsize Your Pods
Kubernetes Pods are the smallest deployable computing units in the Kubernetes container environment. It is a common practice to use a standard template for limits and requests for pod provisioning. If requests describe the minimal requirement for the CPU and memory for a pod to be scheduled on a node, the limits describe the max amount of CPU and memory the pod can consume on that specific node. Typically engineers set the initial limits by using a rule of thumb, such as doubling it just to be on the safe side and then planning to change it later once they have some data to look at. As with many things in life, “later” rarely happens. As a result, the footprint of the cluster inflates over time, exceeding the actual demand for the services running inside the cluster.
Just think about it, if every pod is over-provisioned by 50% and the cluster is always is 80% full, that means that 40% of the cluster capacity is allocated but not used, or simply put — wasted.
2) Turn Off Idle Pods
Many standard instances/VMs and databases in non-production environments are idle outside of working hours and can be turned off or “parked”. The same case exists for Pods, which in non-production environments can and should be scheduled in the same way.
3) Rightsize Your Nodes
Too many worker nodes are the wrong size and type. Kubernetes permits co-allocating the applications on the same nodes, which can dramatically reduce the cloud bill. Yet, incorrectly sized instances and volumes can lead to the inflation of the cost of Kubernetes clusters. Rightsizing could save up to 50% (particularly if no previous action has been taken to rightsize your nodes.)
Another thing to consider is that smaller nodes have a higher relative OS footprint and increase management overhead. The smaller the node, the higher the number of stranded resources. Stranded resources are CPU or memory which are idle, yet cannot be allocated to any of the pods, because the pods which are to be scheduled are too big to claim it. If a pod’s sizes are close to the size of the node (server) the percentage of the resources which are stranded gets higher.
4) Consider Storage Opportunities
Out of the box, containers lose their data when they restart. This is fine for stateless components but becomes an issue when a persistent data store is required. One place to look for additional container optimization opportunities is the overprovisioning of persistent storage (EBS, Azure Storage Disks, etc) related to your containers. There are a number of options to optimize container storage, particularly virtualized storage that can be shared by multiple containers, and which persists over time, without being destroyed when individual containers are destroyed. There are a few different persistent-storage plugins and plugin-driven storage solutions available from third-party vendors.
5) Review Purchasing Options
All of the preceding options related to the actual configuration of your container infrastructure. Just as important as this is ensuring that your purchasing options closely align with your needs. Ensuring the correct instance/VM purchase type for your containerized infrastructure is critical to ensuring flexibility and maximizing ROI. Carefully analyze your purchasing options (e.g. on-demand, reservations and spot) to select the right option for your workload, both in terms of size and usage schedule. Note that reserved instances are not always the best option for resources that can be scheduled to be turned off. Leverage cost optimization tools to support the earlier options for instance scheduling and rightsizing. Such tools can often change the equation and help avoid lock-ins and upfront commitments.
Container Optimization is Just Another Kind of Resource Optimization
The opportunities to save money through container optimization are in essence no different than for your non-containerized resources. Native tools, from either the cloud provider or open source, can help with this, but their capabilities are limited. For a fully optimized environment, you’ll want to take advantage of the growing ecosystem of specific cost optimization tools.
Stay tuned for news from ParkMyCloud on this front coming soon!
Today we’re going to look at an interesting trend we are seeing toward the use of custom machine types in Google Cloud Platform. One of the interesting byproducts of managing the ParkMyCloud platform is that we get to see changes and trends in cloud usage in real time. Since we’re directly at the customer level, we can often see these changes before they are spotted by official cloud industry commentators. They start off as small signals in the noise, but practice has allowed us to see when something is shifting and a trend is emerging – as is the case with these custom machine types.
Over the last year, the shift to greater use of Custom Machine Types (launched in 2016 on Google Compute Engine) and to a lesser extent Optimized EC2 instances (launched in 2018 on AWS) are just such a signal that we have observed growing in strength. Interestingly, Microsoft has yet to offer their equivalent version on Azure.
What do GCE custom machine types let you do?
Custom machine types let you build a bespoke instance to match the specific needs of your workload. Many workloads can be matched to off-the-shelf instance types, but there are many workloads for which it is now possible to build a better matching machine which delivers a more cost effective price. The growth in adoption of this particular instance type supports the case and likely benefits of their availability.
So what are the benefits of these new customized machines? First, they provide a granular level of control to match the needs of your specific application workloads. In practice, this leads to compromise as you select the closest instance type to your optimal configuration. Such compromises typically lead to over-provisioning, a situation we see across the board among our customer base. We analyzed usage of the instances in our platform this summer, and found that across all the instances in our platform, the average vCPU utilization was less than 10%!
Secondly, they allow you to finely tune your machine to maximize the cost effectiveness of your infrastructure. Google claims savings of up to 50% when utilizing their customized options compared to traditional predefined instances, which we believe to be a reasonable assessment as we see the standard instance types are often massively overprovisioned.
On GCE, the variables that you can configure include:
Quantity and type of vCPU’s;
Quantity and type of GPU;
Memory size (albeit there are some limits on the maximum per vCPU).
On AWS, customized options are currently more limited and include only the number and type of vCPU’s, and the options are focused on per-core licensed software problems, rather than cost optimization. It will be interesting if they follow Google and open up cost-based customization options in the coming months, and allow the effective unbundling of fixed off-the-shelf instance types.
Should you use custom machine types?
So just because customization is an option, is this something you should actually pursue? In fact, you will pay a small premium compared to the size of standard instances/VMs, albeit you can optimize for specific workloads, which oftentimes will mean an overall lower cost. To make such an assessment will require that you examine your applications resource use and performance requirements. Such determinations require that you carefully analyze your infrastructure utilization data. This quickly gets complex, although there are a number of platforms which can support thorough analytics and data visualization. Ideally, such analytics would be combined with the ability to recommend specific cost-effective customized instance configurations as well as automate their provisioning.
Watch this space for more news on custom machine types!