Thoughts on more intelligent cache selection #1198

bbockelm · 2024-05-01T13:25:09Z

bbockelm
May 1, 2024
Maintainer

We've known for awhile that having the cache selection based solely on GeoIP is problematic. Most significant issues:

No way for the load on the cache to "feed back" to the selection process. This results in a thundering herd problem where a large number of clients will knock over a single cache, then the next cache on the list, then the next...
Doesn't take into account file locality and the costs for making a new copy from the origin. A nearby cache without the file will be preferred 100% of the time over a slightly more distance cache with the file.
Doesn't balance across multiple caches within a region.

I propose a new weighted round robin algorithm. When matchmaking between a client and a server, we:

Select the top 6 servers based on GeoIP.
Assign a weight of 1.0 to each server.
Modify the weight based on GeoIP. If the server and client are at the same location, then the weight remains 1.0. Then, its decreased by half for every 200 miles of distance (meaning a server 800 miles away has a weight of 0.125).
Multiply the weight by 2.0 if the server is known to have the object. Multiply by 0.5 if the server is known to not have the object. Do nothing if we don't know whether the object is cached.
Multiply the weight by a load multiplier. If the load is unknown, it is considered to be 1. If the load is between 0.0 and 10.0, then the load multiplier is 1.0. Above 10.0, the load multiplier decreases by half for every 4 units of load. So, a load of 18.0 multiplies the weight by 0.25 while a load of 8 multiplies by 1.0.
Create ranges [0, weight_1), [weight_1, weight_1 + weight_2), ... for servers 1 through 6. Select a random number in the range [0, sum(weights)). If the number falls within the range corresponding to that server, it is sorted to the top.
Repeat step (6) to select a second and third place server.
Respond to the client with the top 3 servers.

Constants in this proposal are:

The reasons I like this approach are:

Takes into account GeoIP, load, and object locality.
Stochastic -- we randomly distribute out clients instead of put the load deterministically on one server.
In the lack of load and object locality data, it's similar to today's algorithm.