Skip to content

Commit a1afb4c

Browse files
committed
Completed cache doc, some todos to fill in based on local integration testing
1 parent fdf52c0 commit a1afb4c

File tree

1 file changed

+57
-8
lines changed

1 file changed

+57
-8
lines changed

docs/user-guides/advanced/model-memory-cache.md

Lines changed: 57 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,17 @@
22

33
# In-Memory Model Cache
44

5-
Guardrails supports an in-memory cache to store user-prompts and the LLM response to them.
6-
This can be applied to any model, using the `Model.cache` field
5+
Guardrails supports an in-memory cache which avoids making LLM calls for repeated prompts. It stores user-prompts and the corresponding LLM response. Prior to making an LLM call, Guardrails first checks if the prompt matches one already in the cache. If the prompt is found in the cache, the stored response is returned from the cache, rather than prompting the LLM. This improves latency.
6+
In-memory caches are supported for the Main LLM, and all Nemoguard models ([Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect)). Each model can be configured independently.
7+
The cache uses exact-matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions.
8+
For observability, cache hits and misses are visible in OTEL telemetry, and stored in logs on a configurable cadence.
9+
To get started with caching, an example configuration is shown below. The rest of the page has a deep-dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally-scalable service.
710

811
## Example Configuration
912

10-
Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` is shown below.
13+
Let's walk through an example of adding caching to a Content-Safety Guardrails application. The initial `config.yml` without caching is shown below.
14+
We are using a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) main LLM to generate responses, and checking user-input and LLM-response using the [Llama 3.1 Nemoguard 8B Content Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety) model.
15+
The input rail checks the safety of the user prompt before sending it to the main LLM. The output rail checks both the user input and Main LLM response to make sure the response is safe.
1116

1217
```yaml
1318
# Content-Safety config.yml (without caching)
@@ -29,9 +34,9 @@ rails:
2934
- content safety check output $model=content_safety
3035
```
3136
32-
The yaml file below shows the same configuration, but this time with caching enabled on the main LLM and Content-Safety Nemoguard model.
33-
The `cache` section controls the caching. The `meta/llama-3.3-70b-instruct` model has a cache with a maximum size of 1,000 entries, while the `nvidia/llama-3.1-nemoguard-8b-content-safety` has a cache maximum size of 10,000 entries.
34-
Both caches have telemetry reporting enabled.
37+
The yaml file below shows the same configuration, with caching enabled on the Main and Content-Safety Nemoguard models.
38+
The Main LLM and Nemoguard Content-Safety caches have maximum sizes of 1,000 and 10,000 respectively.
39+
Both caches are configured to log cache statistics. The Main LLM cache statistics are logged every 60 seconds (or 1 minute), while the Content-Safety cache statistics are logged every 360 seconds (or 5 minutes).
3540
3641
```yaml
3742
# Content-Safety config.yml (with caching)
@@ -44,6 +49,7 @@ models:
4449
maxsize: 1000
4550
stats:
4651
enabled: true
52+
log_interval: 60
4753

4854
- type: content_safety
4955
engine: nim
@@ -53,6 +59,8 @@ models:
5359
maxsize: 10000
5460
stats:
5561
enabled: true
62+
log_interval: 360
63+
5664
rails:
5765
input:
5866
flows:
@@ -62,10 +70,51 @@ rails:
6270
- content safety check output $model=content_safety
6371
```
6472
73+
## How does the Cache work?
74+
75+
When the cache is enabled, prior to each LLM call we first check to see if we sent the same prompt to the same LLM. This uses an exact-match lookup, after removing whitespace.
76+
If there's a cache hit (i.e. the same prompt was sent to the same LLM earlier and the response was stored in the cache), then the response can be returned without calling the LLM.
77+
If there's a cache miss (i.e. we don't have a stored LLM response for this prompt in the cache), then the LLM is called as usual. When the response is received, this is stored in the cache.
78+
79+
For security reasons, user prompts are not stored directly. After removing whitespace, the user-prompt is hashed using SHA256 and then used as a cache key.
6580
66-
## Least Frequently Used Cache
81+
If a new cache record needs to be added and the cache already has `maxsize` entries, the Least-Frequently Used (LFU) algorithm is used to decide which cache record to evict.
82+
The LFU algorithm ensures that the most frequently accessed cache entries remain in the cache, improving the probability of a cache hit.
6783

84+
## Telemetry and logging
85+
86+
Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces, with cache hits having a far shorter duration and no LLM call and cache misses having an LLM call. This OTEL telemetry is a good fit for operational dashboards.
87+
The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format below.
88+
The most important metric below is the "Hit Rate", which is the proportion of LLM calls returned from the cache. If this value remains low, the exact-match may not be a good fit for your usecase.
89+
**TODO! Do these reset on every measurement period, or increment forever (rollover concerns?)**
90+
91+
92+
```
93+
# TODO! Replace with measured values
94+
"LFU Cache Statistics - "
95+
"Size: {stats['current_size']}/{stats['maxsize']} | "
96+
"Hits: {stats['hits']} | "
97+
"Misses: {stats['misses']} | "
98+
"Hit Rate: {stats['hit_rate']:.2%} | "
99+
"Evictions: {stats['evictions']} | "
100+
"Puts: {stats['puts']} | "
101+
"Updates: {stats['updates']}"
102+
```
103+
104+
These metrics are detailed below:
105+
106+
* Size: The number of LLM calls stored in the cache.
107+
* Hits: The number of cache hits.
108+
* Misses: The number of cache misses.
109+
* Hit Rate: The proportion of calls returned from the cache. This is a float between 1.0 (all calls returned from cache) and 0.0 (all calls sent to LLM)
110+
* Evictions: Number of cache evictions.
111+
* Puts: Number of new cache records stored.
112+
* Updates: Number of existing cache records updated.
68113
69-
## Telemetry
70114
71115
## Horizontal scaling and caching
116+
117+
This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend-service, there are many Guardrails nodes running behind an API Gateway and load-balancer to distribute traffic and meet availability and performance targets.
118+
The current cache implementation has a separate cache on each node, with no sharing of cache entries between nodes.
119+
Because the load balancer spreads traffic over all Guardrails nodes, requests have to both be stored in cache, with the load balancer directing the same request to the same node.
120+
In practice, frequently-requested user prompts will likely be spread over Guardrails nodes by the load balancer, so the performance impact may ne less significant.

0 commit comments

Comments
 (0)