U.R.P. or: How we reduced our annual Azure bill by $36,000!
With increased Cloud consumption, cost optimisation has become a concern for organisations moving to the Cloud. In this series, we talk about unique but tried and tested methods to reduce Cloud footprint. I will show you how to minimise your Redis Caches on Cloud by sharing a single cluster among multiple applications for this blog.
Too many Redis Cache 🤷♂️
A few months back, when we started taking a deep dive into our Azure Subscription to reduce our monthly Azure bill, we realised that Azure Caches for Redis were among the highest expenditure sources. We follow a typical Micro-service styled architecture, with each service using its own isolated Redis Cache for each environment. Furthermore, to satisfy stakeholder demands for BCDR, we had replicas of the caches in alternate regions.
Redis Cache can do wonders for your application performance, but it can burn a hole in your pocket. If you check Azure Calculator, a Premium Redis Cache with 2 instances will charge you ~$1,200 per month.
We went one more level deep, and our analysis revealed that we do not use most of our Caches to their full potential (both in terms of storage and load). For example, most of our applications uses 1GB of a P1 Redis Cache, which has a maximum capacity of 6GB.
The Idea 💡
Naturally, we were motivated to reduce the number of Redis Caches in our Subscription with such a cost leak. That's when the idea struck us - what if we merge the existing Redis Caches into a single cluster and let applications use the same cache.
Well, although the idea sounded novel, there were some obvious gaping issues that we had to deal with in such a scenario:
Key Isolation: Imagine you have 2 micro-services, Order Service and Profile Service, using the same cache. Both deals with a Customer DTO (but with a different contract), and both need to cache Customer information for faster processing. In all probability, both these services are using customer ID as a unique key to cache these objects. Well, that's where disaster strikes, with each service trying to override each other's cached objects.
Management Concerns: The Redis Console is a pretty popular tool among Developers. It allows us to quickly flush the cache, check statistics, search for patterns, and a bunch of other utilities. Running these admin commands on a shared cache is bound to have side effects. For example, one service flushes the cache, resulting in the cache getting flushed for all services.
Configuration Management: The shared cluster might need to be upgraded, affecting how applications connect to the cluster (e.g. if you change the HTTPS port of the cache, then the connection string needs to be updated). When multiple applications are sharing the same instance, performing such a bulk configuration update at the same time can be a daunting task.
Key Isolation 🎹
The first challenge that we set about solving was providing proper virtual isolation between the services. We had to ensure that the keys operated by numerous applications do not interfere with each other. We tackled this by introducing a prefix-styled namespacing for each of the applications. In other words, each application would prefix all the keys with a unique 3-6 letter phrase; we call it the moniker. Each application would choose a unique moniker, and when interacting with the shared cluster, this moniker would be the prefix for all keys.
Now the next challenge was to coordinate this namespacing across the applications. With 2 or 3 applications, it's easy to control the namespacing. However, this approach of mutual respect and communication fails when scaling the framework to include multiple applications. Another complaint we received was that now with this namespacing mandate, code changes and a complete regression was need by these applications. Plus, since a leak from one application could potentially affect all other applications in a centralised system, the regression testing would have to be conducted simultaneously by all partners (not at all scalable or manageable).
At this point, we realised that we need to build a platform to control an influx of applications.
Enter URP 📢
With all the centralisation problems and the need to scale, we invested in creating an open-sourced platform to unify the Redis Caches. And we decided to name it:
Unified Redis Platform (aka URP)
URP is an open-sourced solution that enables multiple applications to safely share and develop on the same Azure Redis Cache Cluster without any loss of functionality, performance and autonomy
It implies that you can go onboard in a shared cluster with minimal development and testing effort.
How we build it 🏗
The architecture is fairly simple, and there are 2 major components - the Service Layer and the client SDK. For a detailed description of the other components, see our official documentation.
The service layer is what provides the centralisation aspect of the system. It has the following responsibilities.
Handle application onboarding and maintain centralised app metadata configuration. The service layer is responsible for assigning a unique moniker to the onboarded applications.
Maintain connection information for the Redis Cluster, and supply the same to the client SDKs. Hence, the Redis connection details are abstracted from the onboarded applications.
Monitor usage statistics and other telemetry of the shared cluster
If there are multiple Redis instances in the Shared Cluster (e.g. Redis caches in mirror regions for BCDR), the service layer decides which instance should be used by an application to minimise regional latency.
The Client SDK provides all the necessary abstractions to the application to reduce development effort. It's a .NET Standard library that has to be installed by the applications. The main responsibilities include
Wrapper around all Redis Operations and responsible for key namespacing to ensure that application business layers are not affected by the namespace mandate.
Interact with the Service Layer to get the connection details and app metadata. Responsible for all interactions with the Redis Cluster.
It provides additional utilities such as Retry in case of failures, timeouts based on app metadata and capturing usage & performance telemetry.
URP SDK also provides the mechanism to maintain data consistency in scenarios where the shared cluster contains Redis instances in multiple regions. Read more.
For more details about the architecture, see the official documentation.
Zero Development Disruption
Another obvious challenge in reducing development effort was to ensure parity between the pre-existing libraries used by the applications to interact with Redis Cache and the new URP SDK. Therefore, we surveyed all known applications integrated with Redis Cache, and we observed that mainly 2 libraries are used by all the applications.
To ensure parity, we built the URP SDK on top of the StackExchange.Redis library. We introduced a few additional interfaces like IUnifiedConnectionMultiplexer and IUnifiedDatabase, but ensured that we implement all existing interfaces like IConnectionMultiplexer and IDatabase. Thus, only the connection mechanism needs to be changed; all other operations won't be affected since URP SDK wraps all other interfaces and methods.
Similarly, for IDistributedCache in ASP.NET Core, we released an extension package (DistributedCache.Extensions.UnifiedRedisPlatform) which extends all methods of IDistributedCache. Similar to the core library, we also introduced an additional interface IDistributedUnifiedCache for performing URP related utilities.
Please see our official documentation for more details.
Admin Isolation 👷♀️
The final critical puzzle was solving the admin management problems. First, as described in the previous section, we had to ensure that administration related work on the shared cluster must have the same isolation level.
We constructed a C# console application that the administrator can use to connect to the Shared Cluster. All utilities provided by the Management Console follows URP standard in terms of key namespacing and connection protocols. Admins can safely use the Management Console for the following activities
Flush application cache (without disrupting other application's data)
Ping remote cluster to check availability
Delete single key
Scan all keys
Pattern-based key search
View cached value (only for String-based keys)
Create a string-based key
You can read more about it in our official documentation.
The Management Console is a work in progress, and we are looking at creating a GUI-based interface for ease of use.
Ground Report 📉
Internally, we have onboarded 6 applications to a Shared Cluster containing 2 P1 (6GB) Redis instances located in Azure East US and West US.
$36,000 per year reduction in cost
Handles 4.7 million operations per day
There hasn't been any infraction to the perceived performance SLA. The below data is for the last 7 days (with all figures in milliseconds).
Help us grow 💕
URP has been an experimental idea, and we have seen sound overall cloud cost reduction in our Organisation. We would like to know the community's feedback about the same.
If you liked this idea and would like to know more about it or want to be part of our community, please watch/start our GitHub repository.