Idea for Discussion: Device Memory Management Primitives #780

wacky6 · 2024-11-07T06:52:53Z

The idea is perhaps future-looking, but I'd like to bring it up for discussion.

Motivations

Reduce the GPU/NPU memory required for completing a use case (e.g. text2image).
Reduce the memory copy overhead of loading model weights (e.g. on disk) into GPU/NPU
Enable processing of models whose weights can't fit in main memory as a whole (e.g. pipelining)

Real World Examples

Case 1: Text to Image

Text to Image use cases generally involves multiple models, represented as a three stage pipeline (consisting of at least 3 models).

Take FLUX to generate an image of 1024x1024 for example:

Stage	Model(s) Example	Model Weight Size	Model's Resident Memory on GPU (Approx.)
1. Text to Embedding	Google T5 XXL and CLIP-L; fp8	7GB	8-12GB
2. Diffusion / Denoising	FLUX UNet; fp8	12GB	16GB
3. Image decoding	FLUX Variational AutoEncoder; fp8	200MB	16GB

In the ideal case, all three stages fits in GPU memory (totaling > 32GB). This exceeds the limit of every consumer GPU (except Apple M series with a large unified memory).

The only practical way to run FLUX is to "load, compute, unload" each model into GPU in sequence, at the cost of "reinitialize" each stage for each text2image inference.

This reduces the required GPU memory max(required_memory_per_stage) from sum(required_memory_per_stage), and requires the main memory to fit sum(size_of_model_weight).

Note:

Stable Diffusion has the same architecture (three stages), and uses pipelining on consumer GPUs.
Using a larger image size increases the resident memory proportionally for stage 2 and 3.

Case 2: Mixture of Experts (MoE) model

MoE models are self-explanatory. They used multiple small models to produce one inference result (e.g. the next token in LLMs).

Only one small model (say 1/8 size of the entire model) need to reside in GPU memory at a time. Each small model sequentially computes their result, then the results are merged into a single output token.

A high level pseudo code:

while output != END_OF_TEXT:
  outputs = []
  for small_model of small_models:
    small_model.load_to_gpu()
    out = small_model.predict_next_token()
    outputs.append(out)
    small_model.unload_from_gpu()
 output_token = select_from_outputs(outputs)
 emit_to_caller(output_token)

If the GPU has enough memory, all of the small models can reside in memory (load and unload becomes no-op). If not, the small models are repeatedly load to and unload from gpu (usually to/from main memory).

Some LLMs adopt an architecture where the number of activated params (model weight that has to reside in GPU memory) is much smaller than the number of total params (total size of model weight). I believe they functions similarly to MoE at inference time from memory usage standpoint.

Examples:

Mixtral 8x7b
DeepSeek v2 (236B total params, 21B activated param per token)

Case 3: Model Streaming

I observe this technique when playing with Ollama.

If the model weight is too big to fit into main memory, the model weight are streamed from disk during inference (e.g. the model will be read from disk N times for N predicted tokens).

It's slow (bottlenecked by disk throughput), but does allow large models inference to happen.

Current WebNN API

Case 1 and 2 are feasible but not efficient. Model pipelining and partitioning involves destroying and rebuilding the compute graph.

For every inference (e.g. generate one image from a text prompt, predict one token using MoE model), we need to:

Copy model weights from JavaScript ArrayBuffers into WebNN service process
And, call platform-specific API to build (with an optional, potentially expensive, fuze/optimize pass)

Case 3 is infeasible because the entire model weight needs to be copied to WebNN service process before we can build a graph. We can fallback to model partitioning, and convert the problem to case 2.

API / Spec Implication

The topic for discussion :)

Two primitives?

Swap a built MLGraph between GPU/NPU memory and main memory (e.g. PyTorch model.to(device))
Memory mapping from file-on-disk to main memory (POSIX mmap equivalent?)

Related spec / APIs I can think of:

MLBuffer / MLTensor
Fetch: Obtain a mmap-ed response?
File System Access: mmap a file on disk (from FileSystemFileHandle)?
WebGPU: Shader's compilation cache
@reillyeon mentioned to me about introducing a caching mechanism for MLGraph (e.g. save it for later use and avoid repeated graph compilation). Said mechanism might help here.

The text was updated successfully, but these errors were encountered:

reillyeon · 2024-11-07T18:58:55Z

How much of this is an implementation issue (i.e. implementations can be cleverer than they currently are about memory management when multiple large graphs are active) vs. something that needs to be exposed to developers via the API?

wacky6 · 2024-11-08T04:38:40Z

I think model swapping between GPU/Main Memory is feasible in a clever implementation (a LRU cache of somesort). I'm not sure how much overhead that will be, or if it will make it harder for web applications to get predictable performance characteristics (what if the LRU cache isn't clever enough to adapt to the workload).

I think the "stream from disk" / mmap most likely require some API change (I don't think ArrayBuffers are mmap-ed.

reillyeon · 2024-11-08T22:13:33Z

I think the "stream from disk" / mmap most likely require some API change (I don't think ArrayBuffers are mmap-ed.

Once a graph is built it isn't backed by an ArrayBuffers, it is opaquely held behind the MLGraph interface and an implementation could implement it by keeping the weights on disk and stream / mmap them in as necessary.

Similarly, we've been working on changes to the implementation of MLGraphBuilder.constant() so that the constant value doesn't have to stay in memory after it has been wrapped in an (again opaque) MLOperator instance. This is transparent to developers.

wacky6 · 2024-11-11T02:40:41Z

So reading into ArrayBuffer is still required?

Some model files can be >40GB in size, which won't fit in main memory. So the graph building stage will fail (because building requires an ArrayBuffer in memory).

reillyeon · 2024-11-11T21:22:07Z

With the changes we've made the ArrayBuffers passed to constant() do not need to be held in memory until build() is called. They could be written to disk incrementally. They only need to exist in memory for the constant() call. While a model file can be >4GB in size an individual constant is typically much smaller because it only contains the weights for a single layer of the model.

anssiko added the feature request label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for Discussion: Device Memory Management Primitives #780

Idea for Discussion: Device Memory Management Primitives #780

wacky6 commented Nov 7, 2024

reillyeon commented Nov 7, 2024

wacky6 commented Nov 8, 2024

reillyeon commented Nov 8, 2024

wacky6 commented Nov 11, 2024

reillyeon commented Nov 11, 2024

Idea for Discussion: Device Memory Management Primitives #780

Idea for Discussion: Device Memory Management Primitives #780

Comments

wacky6 commented Nov 7, 2024

Motivations

Real World Examples

Case 1: Text to Image

Case 2: Mixture of Experts (MoE) model

Case 3: Model Streaming

Current WebNN API

API / Spec Implication

reillyeon commented Nov 7, 2024

wacky6 commented Nov 8, 2024

reillyeon commented Nov 8, 2024

wacky6 commented Nov 11, 2024

reillyeon commented Nov 11, 2024