Saving Money on Batch Workloads in Public Cloud

Saving Money on Batch Workloads in Public Cloud

Large companies have traditionally had an impressive list of batch workloads, which run at night, when people have gone home for the day. These include such things as application and database backup jobs; extraction, transform, and load (ETL) jobs; disaster recovery (DR) environment checks and updates; online analytical processing (OLAP) jobs; and monthly/ quarterly billing updates or financial “close”, to name a few.

Traditionally, with on-premise data centers, these workloads have run at night to allow the same hardware infrastructure that supports daytime interactive workloads to be repurposed, if you will, to run these batch workloads at night. This served a couple of purposes:

  • It avoided network contention between the two workloads (as both are important), allowing the interactive workloads to remain responsive.
  • It avoided data center sprawl by using the same infrastructure to run both, rather than having dedicated infrastructure for interactive and batch.

Things Are Different with Public Cloud

As companies move to the public cloud, they are no longer constrained by having to repurpose the same infrastructure. In fact, they can spin up and spin down new resources on demand in AWS, Azure or Google Cloud Platform (GCP), running both interactive and batch workloads whenever they want.

Network contention is also less of concern, since the public cloud providers typically have plenty of bandwidth. The exception of course is where batch workloads use the same application interfaces or APIs to read/write data.

So, moving to public cloud offers a spectrum of possibilities, and you can use one or any combination of them:

  • You can run batch nightly using similar processes as you do in your online data centers, but on separate provisioned instances/virtual machines. This probably results in the least effort to moving batch to the public cloud, the least change to your DevOps processes, and perhaps saves you some money by having instances sized specifically for the workloads and being able to leverage cloud cost savings options (e.g.,  reserved instances);
  • You can run batch on separately provisioned instances/virtual machines, but concurrently with existing interactive workloads. This will likely result in some additional work to change your DevOps processes, but offers more freedom and similar benefits mentioned above. You will still need to pay attention to application interfaces/APIs the workloads may have in common; or
  • At the extreme end of the cloud adoptions spectrum, you could use cloud provider platform as a service (PaaS) offerings, such as AWS Batch, Microsoft Azure Batch or GCP Cloud Dataflow, where batch is essentially treated as a “black box”. A detailed comparison of these services is beyond the scope of this blog. However, in summary, these are fully managed services, where you queue up input data in an S3 bucket, object blob or volume along with a job definition, appropriate environment variables and a schedule and you’re off to races. These services employ containers and autoscaling/resource groups/instance groups where appropriate, with options to use less expensive compute in some cases. (For example, with AWS Batch, you have the option of using spot instances.)

The advantage of this approach is potentially faster time to implement and (maybe) less expensive monthly cloud costs, because the compute services run only at the times you specify. The disadvantages of this approach may be the degree of operational/configuration control you have; the fact, that these services may be totally foreign to your existing DevOps folks/processes (i.e., there is a steep learning curve); and it may tie you to that specific cloud provider.

A Simple Alternative

If you are looking to minimize impact to your DevOps processes (that is, the first two approaches mentioned above), but still save money, then ParkMyCloud can help.

Normally, with the first two options, there are cron jobs scheduled to kick-off batch jobs at the appropriate times throughout the day, but the underlying instances must be running for cron to do its thing. You could use ParkMyCloud to put parking schedules on these resources, such they are turned OFF for most of the day, but are turned ON just-in-time to still allow the cron jobs to execute.

We have been successfully using this approach in our own infrastructure for some time now, to control a batch server used to do database backups. This would, in fact, provide more savings than AWS reserved instances.

Let’s look at specific example in AWS. Suppose you have an m4.large server you use run batch jobs. Assuming Linux pricing in us-east-1, this server costs $0.10 per hour, or about $73 per month. Suppose you have configured cron to start batch jobs at midnight UTC and that they normally complete 1 to 1-½ hours later.

You could purchase a Reserved Instance for that server, where you either pay nothing upfront or all upfront and your savings would be 38%-42%.

Or, you could put a ParkMyCloud schedule where the instance is only ON from 11 pm-1 am UTC, allowing enough time for the cron jobs to start and run. The savings in that case would be 87.6% (including the cost of ParkMyCloud) without the need for a one year commitment. Depending on how many batch servers you run in your environment and their sizes, that could be some hefty savings.

Conclusion

Public cloud will offer you a lot of freedom and some potentially attractive cost savings as you move batch workloads from on premise. You are no longer constrained by having the same infrastructure serve two vastly different types of workloads — interactive and batch. The savings you can achieve by moving to public cloud can vary, depending on the approach you take and the provider/service you use.

The approach you take, depends on the amount of process change you’re willing to absorb in your DevOps processes. If you are willing to throw caution to the wind, the cloud provider PaaS offerings for batch can be quite compelling.

If you wish to take a more cautious approach, then we engineered ParkMyCloud to park servers without the need for scripting, or the need for you to be a DevOps expert. This approach allows you to achieve decent savings, with minimal change to your DevOps batch processes and without the need for Reserved Instances.

Start and Stop RDS Instances on AWS – and Schedule with ParkMyCloud

Start and Stop RDS Instances on AWS – and Schedule with ParkMyCloud

Amazon Web Services shared today that users can now start and stop RDS instances – check out the full announcement on their blog.

This is good news for cost-conscious engineering teams. Until now, databases were generally left running 24×7, even if they were only used during working hours for testing and staging purposes. Now, they can be turned off, so you’re not charged for time you’re not using. Nice!

Keep in mind that stopping the RDS instances will not bring the cost to zero – you will still be charged for provisioned storage, manual snapshots and automated backup storage.

Now, what if you want to start and stop RDS instances on an automated schedule to ensure they’re not left running when they’re not needed? Coming soon, you’ll be able to with ParkMyCloud!

Start and Stop RDS Instances on a Schedule with ParkMyCloud

Since ParkMyCloud was first released, customers have been asking us for the ability to park their RDS instances in the same way that they can park EC2 instances and auto scaling groups.

The logic to start/stop RDS instances using schedules is already in the production code for ParkMyCloud. We have been patiently waiting for AWS to officially announce this capability, so that we could turn the feature ON and release it to the public. That day is finally here!

Our development team has some final end-to-end testing to complete, just to make sure everything works as expected. Expect RDS parking to be released within a couple of weeks! Let us know if you’d like to be notified when this is released, or if you’re interested in beta testing the new functionality.

 

We’re excited about this opportunity to give ParkMyCloud users what they’re asking for. What else would you like to see for optimal cost control? Comment below to let us know.

Cutting through the AWS and Azure Cloud Pricing Confusion (Caveat Emptor)

Cutting through the AWS and Azure Cloud Pricing Confusion (Caveat Emptor)

Before I try to break down the AWS and Azure cloud pricing jargon, let me give you some context. I am a crusty, old CTO who has been working in advanced technology since the 1980’s. (That’s more than 18 Moore’s Law cycles for processor and chipset fans, and I have lost count of how many technology hype cycles that has been.)

I have grown accustomed to the “deal of a lifetime” on the “technology of the decade” coming around about once every week. So, you can believe me, when I tell you have a very low BS threshold for dishonest sales folks and bogus technology claims. Yes, I am jaded.

My latest venture is a platform, ParkMyCloud, that brings together  multiple public cloud providers. And I can tell you first hand that it is not for the faint-of-heart. It’s like being dropped off in the middle of the jungle in Papua, New Guinea. Each cloud provider has its own culture, its own philosophy, its own language and customers, its own maturity level and, worst of all — its own pricing strategy — which makes it tough for buyers to manage costs. I am convinced that the lowest circles of hell are reserved for people who develop cloud service pricing models. AWS and Azure cloud pricing gurus, beware. And reader, to you: caveat emptor.

AWS and Azure Terminology Differences

Case in point: You have probably read the comparisons of various services across the top cloud providers, as people try to wrap their minds around all the varying jargon used to describe pretty much the same thing. For example, let’s just look at one service: Cloud Computing.

In AWS, servers are called Elastic Compute Cloud (EC2) “Instances”. In Azure they are called “Virtual Machines” or “VMs”. Flocks of these spun up from a snapshot according scaling rules are called “auto scaling groups” in AWS. The same things are called “scale sets” in Azure.

Of course cloud providers had to start somewhere, then they learned from their mistakes and improved. When AWS started with EC2, they had not yet released virtual private clouds (VPCs), so their instances ran outside of VPCs. Now all the latest stuff runs inside of VPCs. The older ones are called, “classic” and have a number of limitations.
The same thing is true of Azure. When they first released, their VMs were not set up to use what is now their Resource Manager or be managed in Resource Groups (the moral equivalent of CloudFormation Stacks in AWS). Now, all of their latest VMs are compatible with Resource Manager. The older ones are called, you guessed it … “classic”.

(What genius came up with the idea to call the older versions of these, the ones you’re probably stranded with and no longer want, “classic”?)

Both AWS and Azure have a dizzying array of instances/VMs to choose from, and doing an apples-to-apples comparison between them can be quite daunting. They have different categories: General purpose, compute optimized, storage optimized, disk optimized, etc.

Then within each one of those, there are types or sizes. For example, in AWS the tiny, cheap ones are currently the “t2” family. In Azure, they are the “A” series. On top of that there are different generations of processors. In AWS, they use an integer after the family type, like t2, m3, m4 and there are sizes, t2.small, m3.medium, m4.large, r16.ginormus (OK, I made that one up).  

In Azure, they use a number after the family letter to connote size, like A0, A1, A2, D1, etc. and “v1”, “v2” after that to tell what generation it is, like D1v1, D2v2.

The bottom line: this is very confusing for folks moving their workloads to public cloud from on-premise data centers (yet another Wonderland of jargon and confusion in its own right). How does one decide which cloud provider to use? How does one even begin to compare prices with all of this mess? Cheer up … it gets worse!

AWS and Azure Cloud Pricing – Examining Differences in Charging

To add to that confusion, they charge you differently for the compute time you use. What do I mean?  AWS prices their compute time by the hour. And by hour, they mean any fraction of an hour: If you start an instance and run it for 61 minutes then shut it down, you get charged for 2 hours of compute time.

Microsoft Azure cloud pricing is listed by the hour for each VM, but they charge you by the minute. So, if you run for 61 minutes, you get charged for 61 minutes. On the surface, this sounds very appealing (and makes me want to wag my finger at AWS and say, “shame on you, AWS”).

However, you really have to pay attention to the use case and the comparable instance prices. Let me give you a concrete example. I mentioned my latest venture, ParkMyCloud, earlier. We park (schedule on/off times) for cloud computing resources in non-production environments (without scripting by the way). So, here is a graph of 6 months worth of data from an m4.large instance somewhere in Asia Pac. The m4 processor family is based on the Xeon Broadwell or Haswell processor and it is one of the most commonly used instance types.

This instance is on a ParkMyCloud parking schedule, where it is RUNNING from 8:00 a.m. to 7:00 p.m. on weekdays and PARKED evenings and weekends. This instance, assuming Linux pricing, costs $0.125 per hour in AWS. From November 6, 2016 until May 9, 2017, this instance ran for 111,690 minutes. This is actually about 1,862 hours, but AWS charged for 1,922 hours and it cost $240.25 in compute time.

example of instance uptime in minutes per dayWhy the difference? ParkMyCloud has a very fast and accurate orchestration engine, but when you start and stop instances, the cloud provider and network response can vary from hour-to-hour and day-to-day, depending on their load, so occasionally things will run that extra minute. And, even though this instance is on a parking schedule, when you look at the graph, you can see that the user took manual control a few times. Stuff happens!

What would the cost have been if AWS charged the same way as Azure?  It would have only cost $232.69. Well, that’s not too bad over the course of six months, unless you have 1,000 of these. Then it becomes material.

However, I wouldn’t rush to judgment on AWS. If you look at the comparable Azure VM, the standard pricing DS2 V2, also running Linux, costs $0.152/hour. So, this same instance running in Azure would have cost $290.39. Yikes!

Therefore, in my particular use case, unless the Azure cloud pricing drops to make their CPU pricing more competitive, their per minute pricing really doesn’t save money.

Conclusion

The ironic thing about all of this, is that once you get past all the confusing jargon and the ridiculous approaches to pricing and charging for usage, the actual cloud services themselves are much easier to use than legacy on-premise services. The public cloud services do provide much better flexibility and faster time-to-value. The cloud providers simply need to get out of their own way. Pricing is but one example where AWS and Azure need to make things a lot simpler, so that newcomers can make informed decisions.

From a pricing standpoint, AWS on-demand pricing is still more competitive than Azure cloud pricing for comparable compute engine’s, despite Azure’s more enlightened approach to charging for CPU/Hr time. That said, AWS really needs to get in-line with both Azure and Google, who charge by the minute. Nobody likes being charged extra for something they don’t use.

In the meantime, ParkMyCloud will continue to help you turn off non-production cloud resources, when you don’t need them and help save you a lot of money on your monthly cloud bills. If we make anything sound more complex than it needs to, call us out. No hiding behind jargon here.

The Cloud Waste Problem That’s Killing Your Business (and What To Do About It)

cloud wasteWaste not, want not. That was one of the well-healed quips of one the United States’ Founding Fathers, Benjamin Franklin. It couldn’t be more timely advice in today’s cloud computing world – the world of cloud waste. (When he was experimenting with static electricity and lightning, I wonder if he saw the future of Cloud? :^) )

Organizations are moving to the Cloud in droves. And why not? The shift from CapEx to monthly OpEx, the elasticity, the reduced deployment times and faster time-to-market: what’s not to love?

The good news: the public cloud providers have made it easy to deploy their services. The bad news: the public cloud providers have made it easy to deploy their services…really easy.  

And, experience over the past decade has shown that leads to cloud waste. What is “cloud waste” and where does it come from? What are the consequences? What can you do to reduce it?

What is Cloud Waste?

“Cloud waste” occurs when you consume more cloud resources than you actually need to run your business.

It takes several forms:

  • Resources left running 24×7 in development, test, demo, and training environments where they don’t need to be running 24×7.  (Thoughts of parents yelling at children to “turn the lights out” if they are the last one in a room.) I believe this is bad habit that was reinforced by the previous era of on premise data centers. The thinking: It’s a sunk cost any, why bother turning it off?  Of course, it’s not a sunk cost anymore.

This manifests itself in various ways:

    • Instances or VMs which are left running, chewing up $/CPU-Hr costs and network charges
    • Orphaned volumes (volumes not attached to any servers), which are not being used and incurring monthly $/GB charges
    • Old snapshots of those or other volumes
    • Old, out-of-date machine images

However, cloud consumers are not the only ones to blame. The public cloud providers are also responsible when it comes to their PaaS (platform as a service) offerings for which there is no OFF switch (e.g., AWS’ RDS, Redshift, DynamoDB and others). If you deliver a PaaS offering, make sure it has an OFF switch.

  • Resources that are larger than needed to do the job. Many developers don’t know what size instance to spin up to do their development work, so they will often spin up larger ones. (Hey, if 1 core and 4 GB of RAM is good, then 16 cores and 64 GB of RAM must be even better, right?) I think this habit also arose in the previous era of on-premise data centers: “We already paid for all this capacity anyway, so why not use it?” (Wrong again.)

This, too, rears its ugly head in several ways:

    • Instances or VMs which are much larger than they need to be
    • Block volumes which are larger than they need to be
    • Databases which are way over-provisioned compared to what their actual IOPS or sequential throughput requirements actually are.

Who is Affected by Cloud Waste?

The consequences of cloud waste are quite apparent. It is killing everyone’s business bottom line. For consumers, it erodes their return on assets, return on equity and net revenue.  All of these ultimately impact earnings per share for their investors as well.

Believe it or not, it also hurts the public cloud providers and their bottom line.  Public cloud providers are most profitable when they can oversubscribe their data centers. Cloud waste forces them to build more, very expensive data centers than they need to, killing their oversubscription rates and hurting their profitability as well. This is why you see cloud providers offering certain types of cost cutting solutions. For example, AWS offers Reserved Instances, where you can pay up front for break in on-demand pricing. They also offer Spot Instances, Auto-Scaling Groups and Lambda.  Azure also offers price breaks to their ELA customer and Scale Sets (the equivalent of ASGs).

How to Prevent Cloud Waste

So, what can you do to address this? Ultimately, the solution to this problem exists between your ears. Most of it is common sense: It requires rethinking… rewiring your brain to look at cloud computing in a different way. We all need to become honorary Scotsmen (short arms and deep pockets… with apologies to my Scottish friends).

  • When you turn on resources in non-production environments, turn on the minimum size needed to get the job done and only grudgingly move up to the next size.
  • Turn stuff off in non-production environments, when you are not using it. And for Pete’s sake, when it comes to compute time, don’t waste your time and money writing your own scripts…that just exacerbates the waste. Those DevOps people should spend that time on your bread and butter applications. Use ParkMyCloud instead! (Okay, yes, that was a shameless plug, but it is true.)
  • Clean up old volumes, snapshots and machine images.
  • Buy Reserved Instances for your production environments, but make sure you manage them closely, so that they actually match what your users are provisioning, otherwise you could be double paying.
  • Investigate Spot fleets for your production batch workloads that run at night. It could save you a bundle.

These good habits, over time, can benefit everyone economically: Cloud consumers and cloud producers alike.  

Tag Your Cloud Resources to Get Organized and Get Automated

Patag cloud resourcesrkMyCloud has built a SaaS platform that allows users to turn off AWS EC2 resources in non-production environments during off-hours. We call this “parking”.  

In order for ParkMyCloud to automatically recommend instances to be parked – or to simply automate the parking process altogether – users should have a naming convention for their instances or some type of tag on the resources.

Of course, our application is not the only one which relies on you having some type of tagging strategy for your resources. In fact, I’ll wager that most cloud applications involved in some type of cloud analytics, DevOps automation or cloud management would also be in a better position to help you help yourself if you had a consistent approach to tagging.

Tagging can help you in all sorts of areas, from asset management, to automation, cost allocation and security.

While I was researching ideas for this blog, I came across a great resource on AWS Tagging Strategies, published by AWS in January of this year, which provides a comprehensive overview on the subject. In fact, I couldn’t have said it better.  You can find it here.

One of the excerpts from this blog which caught my eye was the about Tagging Categories:

Organizations that are most effective in their use of tags typically create business-relevant tag groupings to organize their resources along technical, business, and security dimensions. Organizations that use automated processes to manage their infrastructure also include additional, automation-specific tags to aid in their automation efforts.

We take this concept to heart here at ParkMyCloud in our internal systems, by applying tags based on functional group (prod, dev, test, QA, etc.) and resource type. We also recommend this approach to our customers — many of whom do use category tagging to identify resources as “production” or “dev”, “test”, “QA”, and “staging” for example. This makes it very easy for us to identify and automate the process of “parking” instances and quickly save customers money.

In the coming weeks, we will release “Zero-Touch Parking”, which will take maximum advantage of tagging. Here’s how it will work.

Let’s say you have 100 instances, 50 of which are production instances, and therefore should never be parked. If those 50 instances are tagged “production app”, you will simply create a policy that says instances with that tag assigned will get routed to the Production group, it’s blocked from having a schedule assigned to it, and runs 24x7x365.

On the other hand, let’s say we have an instance tagged “test”. We could have a corresponding policy that assigns that instance(s) to the group “Test Team”, applies a “Test” schedule which turns the instance off at 6:00 p.m. every weekday, on at 8:00 a.m. every weekday, and off on weekends.

The strategies outlined in this helpful document are not limited to AWS, but should be applicable to environments in just about any cloud provider.

So, on behalf of other cloud applications, help us to help you — get out there and start tagging, if you haven’t already.