-
Notifications
You must be signed in to change notification settings - Fork 369
Conversation
Co-authored-by: Lukas Kreussel <[email protected]> Co-authored-by: Philpax <[email protected]>
Sorry for not engaging more with this earlier! Very cool - what's the current status of it, and how do you use it? My understanding is that the existing (cc @LLukas22) |
As far as i know the model loads and infers fine, we only need a more generic |
Hmm yeah, the Do you have a test case for BERT already that we can use to test/do API design with? |
Use cases for BERT models can be a bit iffy as a lot of things are possible and need to be implemented accordingly. I would suggest to simply expose the logits somehow and implement a 99% of user will use these models only to generate embeddings, we should probably focus on that. |
its currently in a state of flux for my usecase: i am looking to add batching support and have this run properly on metal. i agree that the current outputs of the inference model may not be perfect, but for now, it does suffice for just embeddings (with the hacks done in this PR: such as disabling offloading for pooling). i do believe this is usable for embeddings in current state - the embeddings example is on par with the embeddings produced by bert.cpp for the same model files. |
@nerdypepper Looks like they are working on proper matrix x matrix support for metal: ggerganov/llama.cpp#2615 |
I am interested in using BERT for encodings, is there any work I could do on this pull request to get it in a state where it can be merged? Thank you |
Hm, that's a good question. @nerdypepper I'll try to get #428 across the line, but is that a blocker for merging this? |
@philpax i believe it is, inference on metal does not work without latest kernels from llama-cpp. that being said, i have no idea how BERT performs against the branch on #428, this will require some testing. would be happy to pick this up over the weekend. this pr needs some patches from this branch on my fork. side note, gpu inference of any form mostly only benefits from batching, for which, curently, there is no interface in |
No description provided.