Skip to content

Conversation

kashif
Copy link
Contributor

@kashif kashif commented Oct 4, 2025

fixes #238

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kashif! 🙌

I have some questions about the potentially confusing use of configuration and tokenizer, could we maybe make behaviour more explicit, or perhaps defer configuration to a later PR if it's not essential?

var repetitionPenalty: Float?

@Option(help: "Path to a local folder containing tokenizer_config.json and tokenizer.json")
var tokenizerFolder: String?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var tokenizerFolder: String?
var tokenizerPath: String?

(nit: this is perhaps more idiomatic in Swift APIs)


```swift
let compiledURL: URL = ... // path to .mlmodelc
let tokenizerFolder: URL = ... // folder containing tokenizer_config.json and tokenizer.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let tokenizerFolder: URL = ... // folder containing tokenizer_config.json and tokenizer.json
let tokenizerURL: URL = ... // folder containing tokenizer_config.json and tokenizer.json

)
```

Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint. For the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint. For the
Make sure the tokenizer assets come from the same Hugging Face repo as the original checkpoint or are compatible with the model you use. For the

Comment on lines +49 to +55
if let configuration {
self.configuration = configuration
} else if tokenizer == nil {
self.configuration = LanguageModelConfigurationFromHub(modelName: modelName)
} else {
self.configuration = nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing that if configuration is provided, then tokenizer will be silently ignored. These look like two different ways to inject a tokenizer. Could we maybe use multiple initializers instead?

Another option is to just remove the configuration argument for now and discuss in a new PR. Is the main reason to add it to provide a custom HubApi? That's useful, of course, but perhaps we could just provide that instead of the full configuration.

Comment on lines +194 to +195
tokenizerFolder: URL,
computeUnits: MLComputeUnits = .cpuAndGPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizerFolder: URL,
computeUnits: MLComputeUnits = .cpuAndGPU
computeUnits: MLComputeUnits = .cpuAndGPU,
tokenizer tokenizerFolder: URL,

Making it look like an overloaded version of the previous method (keeping same order and overloading the type for tokenizer, while still using tokenizerFolder inside. This is usual in Swift APIs (although perhaps the tokenizer name could be somewhat misleading).

Comment on lines +207 to +208
tokenizer: Tokenizer,
computeUnits: MLComputeUnits = .cpuAndGPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer: Tokenizer,
computeUnits: MLComputeUnits = .cpuAndGPU
computeUnits: MLComputeUnits = .cpuAndGPU,
tokenizer: Tokenizer,

@kashif
Copy link
Contributor Author

kashif commented Oct 8, 2025

thanks! will look shortly and fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Passing Tokenizer to LanguageModel
2 participants