|
| 1 | +(model-memory-cache)= |
| 2 | + |
| 3 | +# Memory Model Cache |
| 4 | + |
| 5 | +Guardrails supports an in-memory cache that avoids making LLM calls for repeated prompts. The cache stores user prompts and their corresponding LLM responses. Prior to making an LLM call, Guardrails checks if the prompt already exists in the cache. If found, the stored response is returned instead of calling the LLM, improving latency. |
| 6 | + |
| 7 | +In-memory caches are supported for all Nemoguard models: [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect). Each model can be configured independently. |
| 8 | + |
| 9 | +The cache uses exact matching (after removing whitespace) on LLM prompts with a Least-Frequently-Used (LFU) algorithm for cache evictions. |
| 10 | + |
| 11 | +For observability, cache hits and misses are visible in OpenTelemetry (OTEL) telemetry and stored in logs on a configurable cadence. |
| 12 | + |
| 13 | +To get started with caching, refer to the example configurations below. The rest of this page provides a deep dive into how the cache works, telemetry, and considerations when enabling caching in a horizontally scalable service. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Example Configuration |
| 18 | + |
| 19 | +The following example configurations show how to add caching to a Content-Safety Guardrails application. |
| 20 | +The examples use a [Llama 3.3 70B-Instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) as the main LLM to generate responses. Inputs are checked by the [Content-Safety](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety), [Topic-Control](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-topic-control), and [Jailbreak Detection](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect) models. The LLM response is also checked by the Content-Safety model. |
| 21 | +The input rails check the user prompt before sending it to the main LLM to generate a response. The output rail checks both the user input and main LLM response to ensure the response is safe. |
| 22 | + |
| 23 | +### Without Caching |
| 24 | + |
| 25 | +The following `config.yml` file shows the initial configuration without caching. |
| 26 | + |
| 27 | +```yaml |
| 28 | +models: |
| 29 | + - type: main |
| 30 | + engine: nim |
| 31 | + model: meta/llama-3.3-70b-instruct |
| 32 | + |
| 33 | + - type: content_safety |
| 34 | + engine: nim |
| 35 | + model: nvidia/llama-3.1-nemoguard-8b-content-safety |
| 36 | + |
| 37 | + - type: topic_control |
| 38 | + engine: nim |
| 39 | + model: nvidia/llama-3.1-nemoguard-8b-topic-control |
| 40 | + |
| 41 | + - type: jailbreak_detection |
| 42 | + engine: nim |
| 43 | + model: jailbreak_detect |
| 44 | + |
| 45 | +rails: |
| 46 | + input: |
| 47 | + flows: |
| 48 | + - jailbreak detection model |
| 49 | + - content safety check input $model=content_safety |
| 50 | + - topic safety check input $model=topic_control |
| 51 | + |
| 52 | + output: |
| 53 | + flows: |
| 54 | + - content safety check output $model=content_safety |
| 55 | + |
| 56 | + config: |
| 57 | + jailbreak_detection: |
| 58 | + nim_base_url: "https://ai.api.nvidia.com" |
| 59 | + nim_server_endpoint: "/v1/security/nvidia/nemoguard-jailbreak-detect" |
| 60 | + api_key_env_var: NVIDIA_API_KEY |
| 61 | +``` |
| 62 | +
|
| 63 | +### With Caching |
| 64 | +
|
| 65 | +The following configuration file shows the same configuration with caching enabled on the Content-Safety, Topic-Control, and Jailbreak Detection Nemoguard NIM microservices. |
| 66 | +All three caches have a size of 10,000 records and log their statistics every 60 seconds. |
| 67 | +
|
| 68 | +```yaml |
| 69 | +models: |
| 70 | + - type: main |
| 71 | + engine: nim |
| 72 | + model: meta/llama-3.3-70b-instruct |
| 73 | + |
| 74 | + - type: content_safety |
| 75 | + engine: nim |
| 76 | + model: nvidia/llama-3.1-nemoguard-8b-content-safety |
| 77 | + cache: |
| 78 | + enabled: true |
| 79 | + maxsize: 10000 |
| 80 | + stats: |
| 81 | + enabled: true |
| 82 | + log_interval: 60 |
| 83 | + |
| 84 | + - type: topic_control |
| 85 | + engine: nim |
| 86 | + model: nvidia/llama-3.1-nemoguard-8b-topic-control |
| 87 | + cache: |
| 88 | + enabled: true |
| 89 | + maxsize: 10000 |
| 90 | + stats: |
| 91 | + enabled: true |
| 92 | + log_interval: 60 |
| 93 | + |
| 94 | + - type: jailbreak_detection |
| 95 | + engine: nim |
| 96 | + model: jailbreak_detect |
| 97 | + cache: |
| 98 | + enabled: true |
| 99 | + maxsize: 10000 |
| 100 | + stats: |
| 101 | + enabled: true |
| 102 | + log_interval: 60 |
| 103 | + |
| 104 | +rails: |
| 105 | + input: |
| 106 | + flows: |
| 107 | + - jailbreak detection model |
| 108 | + - content safety check input $model=content_safety |
| 109 | + - topic safety check input $model=topic_control |
| 110 | + |
| 111 | + output: |
| 112 | + flows: |
| 113 | + - content safety check output $model=content_safety |
| 114 | + |
| 115 | + config: |
| 116 | + jailbreak_detection: |
| 117 | + nim_base_url: "https://ai.api.nvidia.com" |
| 118 | + nim_server_endpoint: "/v1/security/nvidia/nemoguard-jailbreak-detect" |
| 119 | + api_key_env_var: NVIDIA_API_KEY |
| 120 | +``` |
| 121 | +
|
| 122 | +--- |
| 123 | +
|
| 124 | +## How the Cache Works |
| 125 | +
|
| 126 | +When the cache is enabled, Guardrails checks whether a prompt was already sent to the LLM before making each call. This uses an exact-match lookup after removing whitespace. |
| 127 | +
|
| 128 | +If there is a cache hit (that is, the same prompt was sent to the same LLM earlier and the response was stored in the cache), the response is returned without calling the LLM. |
| 129 | +
|
| 130 | +If there is a cache miss (that is, there is no stored LLM response for this prompt in the cache), the LLM is called as usual. When the response is received, it is stored in the cache. |
| 131 | +
|
| 132 | +For security reasons, user prompts are not stored directly. After removing whitespace, the user prompt is hashed using SHA256 and then used as a cache key. |
| 133 | +
|
| 134 | +If a new cache record needs to be added and the cache already has `maxsize` entries, the Least-Frequently Used (LFU) algorithm is used to decide which cache record to evict. |
| 135 | +The LFU algorithm ensures that the most frequently accessed cache entries remain in the cache, improving the probability of a cache hit. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## Telemetry and Logging |
| 140 | + |
| 141 | +Guardrails supports OTEL telemetry to trace client requests through Guardrails and any calls to LLMs or APIs. The cache operation is reflected in these traces: |
| 142 | + |
| 143 | +- **Cache hits** have a far shorter duration with no LLM call |
| 144 | +- **Cache misses** include an LLM call |
| 145 | + |
| 146 | +This OTEL telemetry is suited for operational dashboards. |
| 147 | + |
| 148 | +The cache statistics are also logged on a configurable cadence if `cache.stats.enabled` is set to `true`. Every `log_interval` seconds, the cache statistics are logged with the format shown below. |
| 149 | + |
| 150 | +The most important metric is the *Hit Rate*, which represents the proportion of LLM calls returned from the cache. If this value remains low, the exact-match approach might not be a good fit for your use case. |
| 151 | + |
| 152 | +These statistics accumulate while Guardrails is running. |
| 153 | + |
| 154 | +```text |
| 155 | +"LFU Cache Statistics - " |
| 156 | +"Size: 23/10000 | " |
| 157 | +"Hits: 20 | " |
| 158 | +"Misses: 3 | " |
| 159 | +"Hit Rate: 87% | " |
| 160 | +"Evictions: 0 | " |
| 161 | +"Puts: 21 | " |
| 162 | +"Updates: 4" |
| 163 | +``` |
| 164 | + |
| 165 | +The following list describes the metrics included in the cache statistics: |
| 166 | + |
| 167 | +- **Size**: The number of LLM calls stored in the cache. |
| 168 | +- **Hits**: The number of cache hits. |
| 169 | +- **Misses**: The number of cache misses. |
| 170 | +- **Hit Rate**: The proportion of calls returned from the cache. This is a float between 1.0 (all calls returned from the cache) and 0.0 (all calls sent to the LLM). |
| 171 | +- **Evictions**: The number of cache evictions. |
| 172 | +- **Puts**: The number of new cache records stored. |
| 173 | +- **Updates**: The number of existing cache records updated. |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +## Horizontal Scaling and Caching |
| 178 | + |
| 179 | +This cache is implemented in-memory on each Guardrails node. When operating as a horizontally-scaled backend service, multiple Guardrails nodes run behind an API Gateway and load balancer to distribute traffic and meet availability and performance targets. |
| 180 | + |
| 181 | +The current cache implementation maintains a separate cache on each node without sharing cache entries between nodes. For a cache hit to occur, the following conditions must be met: |
| 182 | + |
| 183 | +1. The request must have been previously sent and stored in a cache. |
| 184 | +2. The load balancer must direct the subsequent request to the same node. |
| 185 | + |
| 186 | +In practice, the load balancer spreads traffic across all Guardrails nodes, distributing frequently-requested user prompts across multiple nodes. This reduces cache hit rates in horizontally-scaled deployments compared to single-node deployments. |
0 commit comments