-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub Action Pipeline Improvements #245
GitHub Action Pipeline Improvements #245
Conversation
- Added `COMMON_DEFINE` env var which will contain all common defines for all platforms (experimental)
@martindevans , I have checked out and build martindevans:fix/more_march_native branch, then reference
Or do I also need to add
If I do it works, but I am not sure if I am using CUDA12, as TaskManager does not show any GPU load. |
You need to download the binaries from this run. To install them in the project you need to overwrite the various files in
That should get you a completely up to date set of binaries. There is one major caveat: there's not really any stability in the llama.cpp API from one version to the next. These binaries have been built with the latest version of llama.cpp and there's no guarantee they'll be compatible with LLamaSharp. If you encounter errors due to that it'll take a bit longer to update LLamaSharp to the new version. |
Note this should also help with #220 since it will add AVX2 to the linux binaries as well. Hopefully faster CI will be less flakey! |
I am not sure if I am following correctly. Since I am using Cuda12 backend, I don't need anything from Then I take Then I reference
But the result is the same:
|
If you're using the LLamaSharp project then it should have everything setup already to reference the DLLs where necessary. Since you're overwriting the existing DLLs you shouldn't need to change anything else. You're right that you don't need anything from the deps.zip folder, but I'd rather not mix up different backend versions even while testing. it's a recipe for confusion!
At the moment there's only one cuda build and it uses AVX2 (as of this PR). In the future we might want to consider building all the variants for CUDA as well as all the variants for CPU, but the CUDA build is extremely slow so it's not doing that at the moment.
I'm not quite sure what you mean here? If you've pulled my fork then you should just be able to run on of the LLamaSharp examples with no further changes (except setting
This will either be caused by the version compatibility issue I mentioned, or you've not got the binaries setup correctly so it can't find them. |
I was using those 2 dependencies in my own project, where I know what works and what does not. Ok, let's try your way: I have updated the files as specified, then I have launched "TestRunner.cs", and chose option 4. I have specified the same model as in my own code, then I am getting total gibberish:
Plus I think I am using |
Unfortunately gibberish probably means there's some incompatible change in the llama.cpp API that I'll need to fix before you can test this. Hopefully I'll get time to do that this weekend. |
I just tried the libllama.dll from the avx folder of the current binaries. The Session is starting but the Chat Bot is responding strange responses. When I'm asking "What is an apple?", he is responding: "kwiet gegenüber then". I suspect it's the same problem that @lexxsoft was talking about :) However, the response speed does not seem to have changed compared to the avx binaries I used 4 days ago. Can someone explain whether the AI calculates the answers to a question first and then returns them word by word, or are the answers calculated word by word? Since there are such unusually long pauses between each word, one might think that it is calculated word by word. This is also how I explain the gibberish that is sometimes returned. I then think the AI has lost the red thread during its answer :) Can you actually configure that answers should only be updated after a complete sentence? Maybe that would be faster than getting an update at such short intervals? |
This PR has added AVX2 support to Linux and Cuda binaries. So unless you're using one of those platforms and AVX2 (not just AVX) you won't see any difference!
Language models never "calculate the answer" or really do any kind of thinking. They are always just picking the most probable next token in the sequence. They fundamentally work token by token, since you can't generate N+2 until you know what was picked for N+1. |
It seems that every time we compile in the pipeline, we clone the latest master branch. Shall we add a file containing the commit id to make it fixed in the pipeline? Thus when we compile the llama.cpp, we read the commit id from the file first and checkout the repository to it. |
Yeah that is something I've thought about fixing. At the moment the pipeline is usually only run manually when I specifically want updated binaries, but it would probably handy to have some kind of override to specify the version. I'd probably do it with a new input here which has a default value of |
I'm planning to integrate the new binaries into this PR tomorrow. As part of that I'll fix whatever has broken due to the updated llama.cpp. Once that's done testing should be as simple as pulling this branch and running the examples :) |
@lexxsoft any testing you can do over on that other PR would be much appreciated :) |
Moved "common" defines (i.e. things that are the same on all platforms) into a single env var.
This common define include
-DLLAMA_NATIVE=OFF
which should fix the issue with AVX2 missing in builds where that isn't defined (e.g. CUDA).