Machine Learning for Optimizing Cloud Server Spend
A few weeks ago, my colleague Dale Hanson and I presented a LinkedIn Live sponsored by our partner, Dataiku. In this presentation, we talked about how to use machine learning for optimizing cloud server spend. This blog is a transcript of that presentation. The original video can be found here.
A Traditional Approach to Configuring On-Premise Servers
Just a little bit of background, this is a project that we did for the investment banking arm of a large financial institution. They were already in the process of migrating physical on-premise servers to the cloud. Normally, when you provision physical servers, you estimate the peak usage for an application or database and then you add on a little bit more for growth and unperceived demand. That’s a lot of guesswork and ultimately, the majority of these servers spend their time well below peak usage, and because the price is fixed for physical hardware, there’s not much you can do.
How Configuration Can Be Different for Cloud Servers
I don’t think I need to spend a lot of time extolling the virtues of cloud computing, but keep in mind you only pay for what you use when you’re doing things in the cloud. Unlike physical servers, where you’re paying for those servers whether they’re making you money or sitting in IT’s closet. If you find that a cloud server is running well below capacity, it’s easy to adjust and scale that server down. Similarly, if you need more resources for a server, greater capacity is just a couple of clicks away. Platforms like Azure and AWS also allow you to scale dynamically so you can set a schedule to increase or decrease capacity depending on needs throughout the day or week.
Balancing Performance and Cost: How to Determine the Right Schedule
The challenge for many organizations is determining the right schedule to manage costs while ensuring you always have enough capacity to run your business smoothly. Using a machine learning solution with Dataiku and Snowflake, we set out to predict the usage throughout the day, and automatically scale servers to that predicted usage.
We’re going to take a look at the data set for a single server that really illustrates this. So, here’s the full data stretching back a few years. In the image below, we’re looking at virtual CPUs in this state, which data was captured about every five minutes.
Something that you’ll note here is that you can see missing data. These orange circles illustrate that sometimes there are even large periods of time where we have no data and that all has to be imputed before we run the machine learning algorithm and that’s part of the pipeline process using Dataiku. Another thing to note is that this technique will be applied to thousands of servers and each server is going to have its own unique utilization. So, a machine learning model will be trained on each individual server to create predictions and schedules. Dataiku makes it easy to productionalize this and apply it to each different server.
Looking at the data, note the “Last Day” quadrant where you can see that this server is being underutilized most of the time. The peak usage here is at night, between 10:00 PM and midnight. During that time, you see a jump up from the usual 2 to 4 vCPUs, all the way up to 10 to 12 vCPUs, and then after a few hours, it scales back down to 2 to 4 vCPUs. I don’t know the details on what this server is used for, but my guess is likely batch processing, where this pattern of utilization is commonplace. The point is to realize that the client is paying for peak utilization. This server is configured with 16 vCPUs, yet it’s only using 10 to 12 vCPUs for two hours a day, and the rest of the time, it’s being way underutilized.
Building the Predictive Server Utilization Model:
The solution we built was to predict the utilization. The model that we used was an STL (Seasonal and Trend decomposition using Loess) model. We tried a variety of other algorithms and most of them did nearly equally well. We chose the STL model since it had the best accuracy of all of the tested models, and it also ran the fastest. Certain models can take much longer to train and that becomes important because since this is a time series application, you have to retrain the algorithm each night, incorporating new data. Algorithms that take longer to run ultimately end up being more expensive for the client to maintain. Accuracy is also paramount. The more accurate your model is, the more money you can save with this solution.
Another interesting finding is that more data doesn’t necessarily improve accuracy. In fact, more data can be harmful because data can become stale very fast. Including more data also causes the model to take longer to train, which ultimately is more expensive. This solution is trained on the most recent seven days’ worth of data to get the most accurate result.
Creating Your Optimized Schedule:
We are forecasting utilization for every 5 minutes for the following day, which is ultimately used to create a schedule. We then round the max predicted vCPU within an interval (30 minutes, in this case) to the nearest vCPU configuration available by the cloud provider.
In this example, let’s look at the time between 10:30 am and 11:00 am, and we’ll say the max predicted vCPU is around 2.8 (orange line). The process will schedule 4 vCPUs (gray line) because 3 vCPUs isn’t a possible configuration. vCPU configurations vary across providers and also across different server types. Typically, the scale is by a factor of 2x, so you’ll usually see something like 1 vCPU, 2 vCPUs, and then 4, 8, 16, and so on. The process rounds up to the next highest option that’s available for that type of server.
The savings here are immediately apparent. With this server’s original configuration without schedules, the client was paying for 16 vCPUs 100% of the time. By building a schedule, you only pay for what you’ve scheduled, which is frequently much lower than 16 vCPUs. Most of the time it’s 2 or 4 vCPUs. During peak usage at night, the scheduler scales the server up to meet the demand, and then back down once it’s not needed anymore. All of the area between the schedule line and 16 vCPUs are savings.
The Risks and How to Avoid Them:
There is a risk to creating schedules for your servers. When you provision your server to handle peak utilization without a schedule, there’s essentially a 0% chance that you’re going to need more vCPUs than you have because your server has full resources available at all times. By creating a schedule using a prediction or forecast, it creates a chance that you’re going to need resources that aren’t currently available to the server.
There are several options to consider to mitigate this risk. The trade-off is always going to be savings versus risk of under-provisioning. A couple of methods include:
- Generate the schedule based on a confidence interval (80%, 95%, etc.)
- This is going to greatly reduce risk, but also greatly reduce cost savings
- Changing the schedule interval
- Cloud providers can scale with virtually no downtime. You could schedule your servers to change every 30 seconds, every 30 minutes, and so on
- Shorter intervals (e.g. 30 seconds) lead to greater savings, but increase risk of under-provisioning
Another important consideration is to understand the purpose of your application. In this example, I don’t know what this server does but it probably runs batch processes during the night. What could happen if you need to exceed the currently scheduled resources? Consider these two possible responses:
- Cloud providers have provided features to scale the server automatically once it reaches peak utilization. This is less desirable than creating a schedule because it takes a bit of time to recognize that the server is at peak utilization and then to scale accordingly, while scaling using a schedule anticipates utilization, but this feature acts as a safety net for incorrect predictions or unforeseeable changes in demand for a server.
- Instead of scaling, you may ask yourself if it’s such a bad thing if the server doesn’t have all the resources that it needs. In the example of nightly batch processes, the server is still going to complete its task correctly, it’s just going to take a little bit longer, which may be acceptable because it will still finish at a reasonable time.
You have to consider how business critical these applications are and then you can build in buffers based on what the servers are used for.
How We Built It
We used Dataiku to productionize the solution. The customer in this case study had thousands of servers and each one is going to have to train its own model based on each server’s historical data. While the process is going to be largely the same for most of them, you still have to impute missing values and roll this out to all those servers. Dataiku makes that process incredibly easy: You get the log data from your servers, you put that into a data warehouse that Dataiku can access, train the models, create the schedules, and then load that back into Azure or AWS and now you have dynamic schedules for your servers based on predicted usage.
If you’re interested in learning more about this project or other ways to drive value with machine learning in your organization. Contact us to schedule a discussion.