lets callers inject a prebuilt Tokenizer in the LanguageModel #278

kashif · 2025-10-04T11:37:08Z

fixes #238

pcuenca

Thanks @kashif! 🙌

I have some questions about the potentially confusing use of configuration and tokenizer, could we maybe make behaviour more explicit, or perhaps defer configuration to a later PR if it's not essential?

pcuenca · 2025-10-08T09:33:48Z

Examples/transformers-cli/Sources/transformers-cli/Transformers.swift

    var repetitionPenalty: Float?

+    @Option(help: "Path to a local folder containing tokenizer_config.json and tokenizer.json")
+    var tokenizerFolder: String?


Suggested change

var tokenizerFolder: String?

var tokenizerPath: String?

(nit: this is perhaps more idiomatic in Swift APIs)

pcuenca · 2025-10-08T09:35:03Z

README.md

+
+```swift
+let compiledURL: URL = ... // path to .mlmodelc
+let tokenizerFolder: URL = ... // folder containing tokenizer_config.json and tokenizer.json


Suggested change

let tokenizerFolder: URL = ... // folder containing tokenizer_config.json and tokenizer.json

let tokenizerURL: URL = ... // folder containing tokenizer_config.json and tokenizer.json

pcuenca · 2025-10-08T09:35:47Z

README.md

+)
+```
+
+Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint. For the


Suggested change

Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint. For the

Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint or are compatible with the model you use. For the

pcuenca · 2025-10-08T09:40:51Z

Sources/Models/LanguageModel.swift

+        if let configuration {
+            self.configuration = configuration
+        } else if tokenizer == nil {
+            self.configuration = LanguageModelConfigurationFromHub(modelName: modelName)
+        } else {
+            self.configuration = nil
+        }


I find it a bit confusing that if configuration is provided, then tokenizer will be silently ignored. These look like two different ways to inject a tokenizer. Could we maybe use multiple initializers instead?

Another option is to just remove the configuration argument for now and discuss in a new PR. Is the main reason to add it to provide a custom HubApi? That's useful, of course, but perhaps we could just provide that instead of the full configuration.

pcuenca · 2025-10-08T09:42:56Z

Sources/Models/LanguageModel.swift

+        tokenizerFolder: URL,
+        computeUnits: MLComputeUnits = .cpuAndGPU


Suggested change

tokenizerFolder: URL,

computeUnits: MLComputeUnits = .cpuAndGPU

computeUnits: MLComputeUnits = .cpuAndGPU,

tokenizer tokenizerFolder: URL,

Making it look like an overloaded version of the previous method (keeping same order and overloading the type for tokenizer, while still using tokenizerFolder inside. This is usual in Swift APIs (although perhaps the tokenizer name could be somewhat misleading).

pcuenca · 2025-10-08T09:43:35Z

Sources/Models/LanguageModel.swift

+        tokenizer: Tokenizer,
+        computeUnits: MLComputeUnits = .cpuAndGPU


Suggested change

tokenizer: Tokenizer,

computeUnits: MLComputeUnits = .cpuAndGPU

computeUnits: MLComputeUnits = .cpuAndGPU,

tokenizer: Tokenizer,

kashif · 2025-10-08T11:27:01Z

thanks! will look shortly and fix

kashif added 2 commits October 4, 2025 13:36

lets callers inject a prebuilt Tokenizer

c4ba14d

add a tokenizerFolder argument

0d0b00a

pcuenca reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lets callers inject a prebuilt Tokenizer in the LanguageModel #278

lets callers inject a prebuilt Tokenizer in the LanguageModel #278

Uh oh!

kashif commented Oct 4, 2025

Uh oh!

pcuenca left a comment

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

pcuenca Oct 8, 2025

Uh oh!

kashif commented Oct 8, 2025

Uh oh!

Uh oh!

	let tokenizerFolder: URL = ... // folder containing tokenizer_config.json and tokenizer.json
	let tokenizerURL: URL = ... // folder containing tokenizer_config.json and tokenizer.json

	Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint. For the
	Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint or are compatible with the model you use. For the

		tokenizerFolder: URL,
		computeUnits: MLComputeUnits = .cpuAndGPU

		tokenizer: Tokenizer,
		computeUnits: MLComputeUnits = .cpuAndGPU

lets callers inject a prebuilt Tokenizer in the LanguageModel #278

Are you sure you want to change the base?

lets callers inject a prebuilt Tokenizer in the LanguageModel #278

Uh oh!

Conversation

kashif commented Oct 4, 2025

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

kashif commented Oct 8, 2025

Uh oh!

Uh oh!