add VLM support, refactor common LM code into MLXLMCommon. breaking API changes #151

davidkoski · 2024-11-01T22:45:35Z

based on models from https://github.com/Blaizzy/mlx-vlm
for Make mlx-vlm examples in swift #132

Status: almost ready, just testing and cleaning up. Models are working. I am using a local override of mlx-swift main.

Xcode 16

Xcode 16 is required to build the example applications and tools. Older Xcode can still build the libraries via swiftpm (so no changes in requirements to any applications or libraries that refer to this).

This change is required because the xcodeproj now refers to the local Package.swift file to get builds consistent with external users. If needed we can switch back to using xcodeproj for library builds (internal) and swiftpm for library builds (external) -- if there is a problem please file an issue and it can be considered.

Additions

There are two new libraries:

MLXVLM contains vision language models that combine images and text prompts to produce text results, e.g. describe this image
MLXLMCommon contains the LanguageModel code that is shared between MLXLLM and MLXVLM

The API between LLM and VLM is identical aside from the preparation of the UserInput.

let parameters = GenerateParameters()

// LLM prompt
let input = UserInput(prompt: "tell me a story")

// VLM prompt
let input = UserInput(prompt: "describe the image", images: [.url(url)])

// inference is identical
let result = try await modelContainer.perform { [generate, input] context in
    let input = try await context.processor.prepare(input: input)
    return try generate(input: input, parameters: parameters, context: context) { token in
        // print tokens as they are generated, stop early, etc.
        return .more
    }
}

VLM example code is available in the llm-tool example:

./mlx-run llm-tool vlm --help
OVERVIEW: evaluate prompt and images to generate text (VLM)

USAGE: llm-tool vlm <options>

OPTIONS:
  --model <model>         Name of the huggingface model or absolute path to directory
  -p, --prompt <prompt>   The message to be processed by the model.  Use @path,@path to load from files, e.g. @/tmp/prompt.txt
  --resize <resize>       Resize images to this size (width, height)
  --image <image>         Paths or urls for input images
...

Breaking Changes

Probably no effect to code external to this repo:

the mlx-swift-examples.xcodeproj now references the local Package.swift to build the libraries
the example code now uses the naming matching external uses of mlx-swift-examples, e.g. import LLM -> import MLXLLM
the library directories are now renamed to match their target names, e.g. LLM -> MLXLLM

Breaking:

some code will now need to import both MLXLLM and MLXLMCommon (particularly code that loads models)
MLXLMCommon contains the common API between LLM and VLM

import MLXLLM
import MLXLMCommon

constants for models have moved from ModelConfiguration to ModelRegistry
this is MLXLM.ModelRegistry and there is also MLXVLM.ModelRegistry

-    let modelConfiguration = ModelConfiguration.phi3_5_4bit
+    let modelConfiguration = ModelRegistry.phi3_5_4bit

the loadModelContainer() function is now LLMModelFactory.shared.loadContainer()
there is a new VLMModelFactory with identical methods for loading VLMs

-     let modelContainer = try await LLM.loadModelContainer(configuration: modelConfiguration)
-    {
+     let modelContainer = try await LLMModelFactory.shared.loadContainer(
+          configuration: modelConfiguration
+    ) {

ModelContainer.perform is now throwing (and in MLXLMCommon):

-     let result = await modelContainer.perform { model, tokenizer in
-          LLM.generate(
+     let result = try await modelContainer.perform { model, tokenizer in
+          try MLXLMCommon.generate(

ModelConfiguration previously had a way to register new configurations. This is now on LLMModelFactory (and VLMModelFactory has the same):

LLMModelFactory.shared.modelRegistry.register(configurations: [modelConfiguration])

Deprecations

An example at the end shows all of these deprecations in context.

Prefer to use the ModelContext.processor to prepare prompts. Previously users would pass in a bare [Int] of tokens, but in order to support more complex inputs (VLMs) the use of bare [Int] is deprecated and callers should use UserInput and LMInput.

For example, previously callers might have done something like this:

let messages = [["role": "user", "content": prompt]]
let promptTokens = try await modelContainer.perform { _, tokenizer in
    try tokenizer.applyChatTemplate(messages: messages)
}

Now that should be:

let input = try await context.processor.prepare(input: .init(prompt: prompt))

Which will initialize a UserInput from the prompt text and produce an LMInput that can be used to generate tokens.

This call to generate() is now deprecated:

public func generate(
    promptTokens: [Int], parameters: GenerateParameters, model: any LanguageModel,
    tokenizer: Tokenizer,
    extraEOSTokens: Set<String>? = nil,
    didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult

This consumed the [Int] variety of tokens. Now this is preferred:

public func generate(
    input: LMInput, parameters: GenerateParameters, context: ModelContext,
    didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult

This method on ModelContainer is now deprecated:

    /// Perform an action on the model and/or tokenizer.  Callers _must_ eval any `MLXArray` before returning as
    /// `MLXArray` is not `Sendable`.
    @available(*, deprecated, message: "prefer perform(_:) that uses a ModelContext")
    public func perform<R>(_ action: @Sendable (any LanguageModel, Tokenizer) throws -> R) rethrows
        -> R

use this one instead (though the former still works):

    /// Perform an action on the ``ModelContext``.  Callers _must_ eval any `MLXArray` before returning as
    /// `MLXArray` is not `Sendable`.
    public func perform<R>(_ action: @Sendable (ModelContext) async throws -> R) async rethrows -> R

Example

Putting all of these deprecations together, previously you might have generated text like this:

            let messages = [["role": "user", "content": prompt]]
            let promptTokens = try await modelContainer.perform { _, tokenizer in
                try tokenizer.applyChatTemplate(messages: messages)
            }

            let result = await modelContainer.perform { model, tokenizer in
                LLM.generate(
                    promptTokens: promptTokens, parameters: generateParameters, model: model,
                    tokenizer: tokenizer, extraEOSTokens: modelConfiguration.extraEOSTokens
                ) { tokens in ... }
            }

now do this:

            let result = try await modelContainer.perform { context in
                let input = try await context.processor.prepare(input: .init(prompt: prompt))
                return try MLXLMCommon.generate(
                    input: input, parameters: generateParameters, context: context
                ) { tokens in ... }
            }

davidkoski · 2024-11-01T22:47:02Z

Libraries/LLM/Configuration.swift

@@ -1,30 +1,7 @@
 // Copyright © 2024 Apple Inc.

 import Foundation
-
-public enum StringOrNumber: Codable, Equatable, Sendable {


move to LMCommon

davidkoski · 2024-11-01T22:47:53Z

Libraries/LLM/LLMModel.swift


-/// Container for models that guarantees single threaded access.


Move to ModelContainer

davidkoski · 2024-11-01T22:48:58Z

Libraries/LLM/LLMModel.swift

-        }
-    }
-}
+// TODO move? these cause some ambiguity -- how to resolve?


I was playing around with these to avoid breaking API -- moving types into LMCommon means callers will need to import LMCommon if they refer to them. This (the aliases) caused more trouble than I think it is worth

davidkoski · 2024-11-01T22:49:48Z

Libraries/LLM/Load.swift

@@ -3,6 +3,7 @@
 import Foundation
 @preconcurrency import Hub
 import MLX
+import MLXLMCommon
 import MLXNN
 import MLXRandom
 import Tokenizers


Ultimately I would like this to move into LMCommon -- I think it can support both LLM and VLM models, but I didn't get a chance to move this yet.

davidkoski · 2024-11-01T22:52:09Z

Libraries/LLM/LoraTrain.swift

 import MLXNN
 import MLXOptimizers
 import MLXRandom
 import Tokenizers

-/// Layers to apply LoRA adapters to.


Move to LMCommon

davidkoski · 2024-11-01T22:52:46Z

Libraries/LLM/LoraTrain.swift

-        return y + scale * z
-    }
-}
-
 /// Equivalent to `lora.py/iterate_batches()`.  Used internally by ``LoRATrain``.
 struct LoRABatchIterator: Sequence, IteratorProtocol {


Ideally the rest of this moves to LMCommon as well -- I think it can.

davidkoski · 2024-11-01T22:54:42Z

Libraries/LMCommon/Evaluate.swift

+    mutating func prompt(_ prompt: MLXArray)
+    func process(logits: MLXArray) -> MLXArray
+    mutating func didSample(token: MLXArray)
+}


The generate / step code has been refactored a bit and can now take custom logit samplers and processors

davidkoski · 2024-11-01T22:55:27Z

Libraries/LMCommon/Evaluate.swift

+    public init(
+        prompt: MLXArray, model: any LanguageModel, cache: [KVCache]? = nil,
+        parameters: GenerateParameters
+    ) throws {


This now takes either a prompt (MLXArray) or an LMInput (text + image + ...) via multiple initializers.

davidkoski · 2024-11-01T22:56:19Z

Libraries/LMCommon/LanguageModel.swift

+    }
+}
+
+public struct LMInput {


A new union type that holds the different inputs to generate() and LanguageModel.prepare()

davidkoski · 2024-11-01T22:56:48Z

Libraries/LMCommon/LanguageModel.swift

+    }
+}
+
+public struct LMOutput {


Union type for the output. Some of the VLMs return additional state, which is represented here.

davidkoski · 2024-11-01T22:57:17Z

Libraries/LMCommon/Models.swift

@@ -134,6 +135,7 @@ extension ModelConfiguration {
        extraEOSTokens: ["<|end|>"]
    )

+    // TODO the prompt formatter is replaced by the chat template


Or is it? #150

davidkoski · 2024-11-01T22:57:48Z

Libraries/LMCommon/Processor.swift

+
+import CoreImage
+import Foundation
+import MLX


This file may be deleted -- it was some notes & thoughts along the way

davidkoski · 2024-11-01T22:58:11Z

Libraries/LMCommon/Prompt.swift

+// Copyright © 2024 Apple Inc.
+
+import Foundation
+import MLX


Also to be deleted -- LMInput replaces this

davidkoski · 2024-11-01T22:59:41Z

Libraries/VLM/MediaProcessing.swift

+private let context = CIContext()
+
+// TODO documentation
+public enum MediaProcessing {


Needs documentation, but see PaliGemmaImageProvider which implements

SiglipImageProcessor { "do_convert_rgb": null, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.5, 0.5, 0.5 ], "image_processor_type": "SiglipImageProcessor", "image_seq_length": 1024, "image_std": [ 0.5, 0.5, 0.5 ], "processor_class": "PaliGemmaProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "height": 448, "width": 448 } }

from the python transformers code.

davidkoski · 2024-11-01T23:00:24Z

Libraries/VLM/Models/Paligemma.swift

+import MLXNN
+import Tokenizers
+
+// MARK: - Language


First cut at a port of https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/paligemma

Note: this builds, loads weights and "runs" but doesn't produce any output -- still needs to be debugged.

it should be usable as an example of the structure I think we need

davidkoski · 2024-11-01T23:02:01Z

Libraries/VLM/Models/Paligemma.swift

+    }
+}
+
+// TODO does not suport multiple images -- how do we represent?


We need a protocol for the image and text processing pieces.

davidkoski · 2024-11-01T23:02:19Z

Libraries/VLM/Models/Paligemma.swift

+        image = MediaProcessing.inSRGBToneCurveSpace(image)
+
+        image = MediaProcessing.resampleBicubic(image, to: .init(width: size, height: size))
+        image = MediaProcessing.normalize(image, mean: (0.5, 0.5, 0.5), std: (0.5, 0.5, 0.5))


SiglipImageProcessor { "do_convert_rgb": null, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.5, 0.5, 0.5 ], "image_processor_type": "SiglipImageProcessor", "image_seq_length": 1024, "image_std": [ 0.5, 0.5, 0.5 ], "processor_class": "PaliGemmaProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "height": 448, "width": 448 } }

davidkoski · 2024-11-01T23:03:18Z

Libraries/VLM/Models/Paligemma.swift

+    }
+}
+
+private func loadConfiguration(url: URL) throws -> PaliGemma {


These next couple of functions are just stubs to let me try it out -- this will work much like the LLM models

davidkoski · 2024-11-01T23:03:53Z

Libraries/VLM/Models/Paligemma.swift

+        private let _ropeTheta: Float?
+        public var ropeTheta: Float { _ropeTheta ?? 10_000 }
+        public let _ropeTraditional: Bool?
+        public var ropeTraditional: Bool { _ropeTraditional ?? false }


Rather than doing the full implementation of Codable I went a simpler route for default values. Less code, cleaner (I think)

davidkoski · 2024-11-01T23:05:31Z

Tools/llm-tool/LLMTool.swift

+    @Option var path: URL
+
+    @MainActor
+    mutating func run() async throws {


Just stub code to exercise the model. This still needs the input processing layers, in particular the prompt processing. The image processing is in place but will need to be wrapped up API-wise.

This is now the real code

davidkoski · 2024-12-04T19:16:54Z

Applications/LLMEval/ContentView.swift

 import MLX
+import MLXLLM
+import MLXLMCommon


See PR description -- split LLM -> LLM and LMCommon. Switched local names to match what people get via swiftpm (MLXLLM, etc.).

davidkoski · 2024-12-04T19:18:24Z

Applications/LLMEval/ContentView.swift

@@ -159,7 +160,7 @@ class LLMEvaluator {

    /// This controls which model loads. `phi3_5_4bit` is one of the smaller ones, so this will fit on
    /// more devices.
-    let modelConfiguration = ModelConfiguration.phi3_5_4bit
+    let modelConfiguration = ModelRegistry.phi3_5_4bit


From PR description:

constants for models have moved from ModelConfiguration to ModelRegistry

this is MLXLM.ModelRegistry and there is also MLXVLM.ModelRegistry

- let modelConfiguration = ModelConfiguration.phi3_5_4bit + let modelConfiguration = ModelRegistry.phi3_5_4bit

@available

- based on models from https://github.com/Blaizzy/mlx-vlm There are two new libraries: - `MLXVLM` contains vision language models that combine images and text prompts to produce text results, e.g. `describe this image` - `MLXLMCommon` contains the `LanguageModel` code that is shared between `MLXLLM` and `MLXVLM` The API between `LLM` and `VLM` is identical aside from the preparation of the `UserInput`. ```swift let parameters = GenerateParameters() // LLM prompt let input = UserInput(prompt: "tell me a story") // VLM prompt let input = UserInput(prompt: "describe the image", images: [.url(url)]) // inference is identical let result = try await modelContainer.perform { [generate, input] context in let input = try await context.processor.prepare(input: input) return try generate(input: input, parameters: parameters, context: context) { token in // print tokens as they are generated, stop early, etc. return .more } } ``` VLM example code is available in the `llm-tool` example: ``` ./mlx-run llm-tool vlm --help OVERVIEW: evaluate prompt and images to generate text (VLM) USAGE: llm-tool vlm <options> OPTIONS: --model <model> Name of the huggingface model or absolute path to directory -p, --prompt <prompt> The message to be processed by the model. Use @path,@path to load from files, e.g. @/tmp/prompt.txt --resize <resize> Resize images to this size (width, height) --image <image> Paths or urls for input images ... ``` Probably no effect to code external to this repo: - the mlx-swift-examples.xcodeproj now references the local `Package.swift` to build the libraries - the example code now uses the naming matching external uses of mlx-swift-examples, e.g. `import LLM` -> `import MLXLLM` - the library directories are now renamed to match their target names, e.g. `LLM` -> `MLXLLM` Breaking: - some code will now need to import both `MLXLLM` and `MLXLMCommon` (particularly code that loads models) - `MLXLMCommon` contains the common API between LLM and VLM ```swift import MLXLLM import MLXLMCommon ``` - constants for models have moved from `ModelConfiguration` to `ModelRegistry` - this is `MLXLM.ModelRegistry` and there is also `MLXVLM.ModelRegistry` ```diff - let modelConfiguration = ModelConfiguration.phi3_5_4bit + let modelConfiguration = ModelRegistry.phi3_5_4bit ``` - the `loadModelContainer()` function is now `LLMModelFactory.shared.loadContainer()` - there is a new `VLMModelFactory` with identical methods for loading VLMs ```diff - let modelContainer = try await LLM.loadModelContainer(configuration: modelConfiguration) - { + let modelContainer = try await LLMModelFactory.shared.loadContainer( + configuration: modelConfiguration + ) { ``` - `ModelContainer.perform` is now throwing (and in MLXLMCommon): ```diff - let result = await modelContainer.perform { model, tokenizer in - LLM.generate( + let result = try await modelContainer.perform { model, tokenizer in + try MLXLMCommon.generate( ``` - `ModelConfiguration` previously had a way to register new configurations. This is now on `LLMModelFactory` (and `VLMModelFactory` has the same): ```swift LLMModelFactory.shared.modelRegistry.register(configurations: [modelConfiguration]) ``` An example at the end shows all of these deprecations in context. **Prefer to use the `ModelContext.processor` to prepare prompts.** Previously users would pass in a bare `[Int]` of tokens, but in order to support more complex inputs (VLMs) the use of bare `[Int]` is deprecated and callers should use `UserInput` and `LMInput`. For example, previously callers might have done something like this: ```swift let messages = [["role": "user", "content": prompt]] let promptTokens = try await modelContainer.perform { _, tokenizer in try tokenizer.applyChatTemplate(messages: messages) } ``` Now that should be: ```swift let input = try await context.processor.prepare(input: .init(prompt: prompt)) ``` Which will initialize a `UserInput` from the prompt text and produce an `LMInput` that can be used to generate tokens. **This call to `generate()` is now deprecated:** ```swift public func generate( promptTokens: [Int], parameters: GenerateParameters, model: any LanguageModel, tokenizer: Tokenizer, extraEOSTokens: Set<String>? = nil, didGenerate: ([Int]) -> GenerateDisposition ) throws -> GenerateResult ``` This consumed the `[Int]` variety of tokens. Now this is preferred: ```swift public func generate( input: LMInput, parameters: GenerateParameters, context: ModelContext, didGenerate: ([Int]) -> GenerateDisposition ) throws -> GenerateResult ``` **This method on `ModelContainer` is now deprecated:** ```swift /// Perform an action on the model and/or tokenizer. Callers _must_ eval any `MLXArray` before returning as /// `MLXArray` is not `Sendable`. @available(*, deprecated, message: "prefer perform(_:) that uses a ModelContext") public func perform<R>(_ action: @sendable (any LanguageModel, Tokenizer) throws -> R) rethrows -> R ``` use this one instead (though the former still works): ```swift /// Perform an action on the ``ModelContext``. Callers _must_ eval any `MLXArray` before returning as /// `MLXArray` is not `Sendable`. public func perform<R>(_ action: @sendable (ModelContext) async throws -> R) async rethrows -> R ``` Putting all of these deprecations together, previously you might have generated text like this: ```swift let messages = [["role": "user", "content": prompt]] let promptTokens = try await modelContainer.perform { _, tokenizer in try tokenizer.applyChatTemplate(messages: messages) } let result = await modelContainer.perform { model, tokenizer in LLM.generate( promptTokens: promptTokens, parameters: generateParameters, model: model, tokenizer: tokenizer, extraEOSTokens: modelConfiguration.extraEOSTokens ) { tokens in ... } } ``` now do this: ```swift let result = try await modelContainer.perform { context in let input = try await context.processor.prepare(input: .init(prompt: prompt)) return try MLXLMCommon.generate( input: input, parameters: generateParameters, context: context ) { tokens in ... } } ```

davidkoski · 2024-12-04T21:05:13Z

This code is ready for review!

awni

This is incredibly cool. I barely touched the surface but leaving a small review and going to try running it shortly.

awni · 2024-12-09T18:45:00Z

Libraries/MLXLLM/README.md

+structure something like this:
+
+```swift
+public class YourModel: Module, LLMModel, KVCacheDimensionProvider, LoRAModel {


Btw I changed the KV cache implementation in mlx-lm to just init the keys and values the first time you call it. There is no need to initialize the KV cache with a head dim etc. so we could probably remove this interface as well. (Just a comment not something that we need to update in this PR)

OK, I will take a look at it -- if it simplifies things it may be worth including here as we are already making some breaking changes.

revisit KVCache / mlx-lm

awni · 2024-12-09T18:48:02Z

Libraries/MLXLLM/README.md

+    public let kvHeads: [Int]
+    public let headDim: IntOrPair


And e.g. got rid of this which is not necessary

Libraries/MLXLLM/README.md

awni · 2024-12-09T19:01:58Z

Tools/llm-tool/LLMTool.swift

-        let (modelContainer, modelConfiguration) = try await memory.start(args.load)
+        let modelContainer = try await memory.start { [args] in
+            try await args.load(
+                defaultModel: "mlx-community/Mistral-7B-v0.1-hf-4bit-mlx",


We should update this default model, it's pretty dated. Maybe to mlx-community/Mistral-7B-Instruct-v0.3-4bit is a good option?

Sure, I will give it a run and make sure it works!

test this

It is one of the preset models, so good to go

Tools/llm-tool/LLMTool.swift

Libraries/MLXVLM/VLMModelFactory.swift

Co-authored-by: Awni Hannun <[email protected]>

awni · 2024-12-09T19:34:55Z

Tools/llm-tool/LLMTool.swift

@@ -203,29 +206,88 @@ struct EvaluateCommand: AsyncParsableCommand {

    @MainActor
    mutating func run() async throws {
-        let (modelContainer, modelConfiguration) = try await memory.start(args.load)
+        let modelContainer = try await memory.start { [args] in


Can we rename this to LMCommand and subcommand lm, to match the VLMCommand.

Alternatively (given the complexity) it might be worth using the same subcommand and just dispatching to the vlm subroutine if an image input is provided or not..

Interesting idea! The default model is different, as is the model factory. We could certainly switch on the presence of an image (or video) to chose but I wonder if that complicates things over just having the two subcommands?

Let me try the refactor to fold these down into one and see if that looks reasonable.

try refactor of vlm -> eval (lm) command

Yes it was a slightly off the cuff suggestion. It simplifies the command line but it might not be worth doing at the expense of code complexity.

I think that worked well -- it came down to this (mostly):

// switch between LLM and VLM let vlm = image.count > 0 if vlm { modelFactory = VLMModelFactory.shared defaultModel = MLXVLM.ModelRegistry.paligemma3bMix448_8bit } else { modelFactory = LLMModelFactory.shared defaultModel = MLXLLM.ModelRegistry.mistral7B4bit }

awni · 2024-12-09T20:10:46Z

Libraries/MLXLMCommon/ModelContainer.swift

+/// ```swift
+/// let messages = [["role": "user", "content": prompt]]
+/// let promptTokens = try await modelContainer.perform { context in
+///     try context.tokenizer.applyChatTemplate(messages: messages)
+/// }
+/// ```
+///
+/// or:
+///
+/// ```swift
+/// let result = await modelContainer.perform { context in
+///     LLM.generate(
+///         promptTokens: promptTokens, parameters: generateParameters, model: context.model,
+///         tokenizer: context.tokenizer, extraEOSTokens: modelConfiguration.extraEOSTokens
+///     ) { tokens in
+///     ...
+///     }


Is this comment outdated?

yes, thanks for spotting that!

awni · 2024-12-09T20:35:54Z