Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Action Pipeline Improvements #245

Merged
merged 2 commits into from
Nov 5, 2023

Conversation

martindevans
Copy link
Member

Moved "common" defines (i.e. things that are the same on all platforms) into a single env var.

This common define include -DLLAMA_NATIVE=OFF which should fix the issue with AVX2 missing in builds where that isn't defined (e.g. CUDA).

 - Added `COMMON_DEFINE` env var which will contain all common defines for all platforms (experimental)
@martindevans
Copy link
Member Author

@lexxsoft
Copy link

lexxsoft commented Nov 4, 2023

@martindevans , I have checked out and build martindevans:fix/more_march_native branch, then reference

  • \LLama\bin\Debug\net7.0\LLamaSharp.dll
  • \LLama\runtimes\libllama-cuda12.dll
    I am getting:
The type initializer for 'LLama.Native.NativeApi' threw an exception.

Or do I also need to add

  • \LLama\runtimes\libllama.dll
    ?

If I do it works, but I am not sure if I am using CUDA12, as TaskManager does not show any GPU load.

@martindevans
Copy link
Member Author

You need to download the binaries from this run.

To install them in the project you need to overwrite the various files in LLamaSharp\LLama\runtimes:

  • deps.zip contains all the various CPU backends. Take the deps/AVX2 binaries from that zip and overwrite libllama.so and libllama.dll
  • deps.zip also contains the MacOS binaries. Take the files from deps/macos-metal and overwrite libllama.dylib and ggml-metal.metal
  • Download cu11.7.1.zip, rename those and overwrite libllama-cuda11.dll/libllama-cuda11.so.
  • Download cu12.7.1.zip, rename those and overwrite libllama-cuda12.dll/libllama-cuda12.so.

That should get you a completely up to date set of binaries.

There is one major caveat: there's not really any stability in the llama.cpp API from one version to the next. These binaries have been built with the latest version of llama.cpp and there's no guarantee they'll be compatible with LLamaSharp. If you encounter errors due to that it'll take a bit longer to update LLamaSharp to the new version.

@martindevans
Copy link
Member Author

Note this should also help with #220 since it will add AVX2 to the linux binaries as well. Hopefully faster CI will be less flakey!

@lexxsoft
Copy link

lexxsoft commented Nov 4, 2023

I am not sure if I am following correctly.

Since I am using Cuda12 backend, I don't need anything from deps.zip, but just to be safe, I take DLL from AVX2 folder and overwrite it on original dependencies, though I am not referencing this file anywhere. Or should I?

Then I take cu12.1.0.zip and extract libllama.dll (there are no AVX2 variant), rename it libllama-cuda12.dll and place in the previous location.

Then I reference

  • \LLama\bin\Debug\net7.0\LLamaSharp.dll
  • \LLama\runtimes\libllama-cuda12.dll

But the result is the same:

The type initializer for 'LLama.Native.NativeApi' threw an exception.

@martindevans
Copy link
Member Author

Since I am using Cuda12 backend, I don't need anything from deps.zip, but just to be safe, I take DLL from AVX2 folder and overwrite it on original dependencies, though I am not referencing this file anywhere. Or should I?

If you're using the LLamaSharp project then it should have everything setup already to reference the DLLs where necessary. Since you're overwriting the existing DLLs you shouldn't need to change anything else.

You're right that you don't need anything from the deps.zip folder, but I'd rather not mix up different backend versions even while testing. it's a recipe for confusion!

there are no AVX2 variant

At the moment there's only one cuda build and it uses AVX2 (as of this PR). In the future we might want to consider building all the variants for CUDA as well as all the variants for CPU, but the CUDA build is extremely slow so it's not doing that at the moment.

Then I reference

I'm not quite sure what you mean here? If you've pulled my fork then you should just be able to run on of the LLamaSharp examples with no further changes (except setting n_gpu_layers of course).

The type initializer for 'LLama.Native.NativeApi' threw an exception.

This will either be caused by the version compatibility issue I mentioned, or you've not got the binaries setup correctly so it can't find them.

@lexxsoft
Copy link

lexxsoft commented Nov 4, 2023

I'm not quite sure what you mean here? If you've pulled my fork then you should just be able to run on of the LLamaSharp examples with no further changes (except setting n_gpu_layers of course).

I was using those 2 dependencies in my own project, where I know what works and what does not.

Ok, let's try your way: I have updated the files as specified, then I have launched "TestRunner.cs", and chose option 4. I have specified the same model as in my own code, then I am getting total gibberish:

llama_new_context_with_model: compute buffer total size = 96.63 MB
emia--------------------------fLA (d ? un'our and 1 and 39a pro, especially impress. matricesell with------------------------------------------------

Plus I think I am using GPU CPU backend in the test, I am not sure how to switch to Cuda12.

@martindevans
Copy link
Member Author

Unfortunately gibberish probably means there's some incompatible change in the llama.cpp API that I'll need to fix before you can test this. Hopefully I'll get time to do that this weekend.

@hswlab
Copy link
Contributor

hswlab commented Nov 4, 2023

I just tried the libllama.dll from the avx folder of the current binaries. The Session is starting but the Chat Bot is responding strange responses. When I'm asking "What is an apple?", he is responding: "kwiet gegenüber then". I suspect it's the same problem that @lexxsoft was talking about :)

However, the response speed does not seem to have changed compared to the avx binaries I used 4 days ago. Can someone explain whether the AI calculates the answers to a question first and then returns them word by word, or are the answers calculated word by word? Since there are such unusually long pauses between each word, one might think that it is calculated word by word. This is also how I explain the gibberish that is sometimes returned. I then think the AI has lost the red thread during its answer :)

Can you actually configure that answers should only be updated after a complete sentence? Maybe that would be faster than getting an update at such short intervals?

@martindevans
Copy link
Member Author

However, the response speed does not seem to have changed compared to the avx binaries I used 4 days ago.

This PR has added AVX2 support to Linux and Cuda binaries. So unless you're using one of those platforms and AVX2 (not just AVX) you won't see any difference!

Can someone explain whether the AI calculates the answers to a question first and then returns them word by word, or are the answers calculated word by word?

Language models never "calculate the answer" or really do any kind of thinking. They are always just picking the most probable next token in the sequence. They fundamentally work token by token, since you can't generate N+2 until you know what was picked for N+1.

@AsakusaRinne
Copy link
Collaborator

It seems that every time we compile in the pipeline, we clone the latest master branch. Shall we add a file containing the commit id to make it fixed in the pipeline? Thus when we compile the llama.cpp, we read the commit id from the file first and checkout the repository to it.

@martindevans
Copy link
Member Author

It seems that every time we compile in the pipeline, we clone the latest master branch.

Yeah that is something I've thought about fixing. At the moment the pipeline is usually only run manually when I specifically want updated binaries, but it would probably handy to have some kind of override to specify the version.

I'd probably do it with a new input here which has a default value of master.

@martindevans
Copy link
Member Author

I'm planning to integrate the new binaries into this PR tomorrow. As part of that I'll fix whatever has broken due to the updated llama.cpp. Once that's done testing should be as simple as pulling this branch and running the examples :)

@martindevans
Copy link
Member Author

@SignalRT already started a PR to update the binaries to a newer version. I'll merge this PR, so we can use the action to generate binaries for #249

@martindevans martindevans merged commit ec20ab3 into SciSharp:master Nov 5, 2023
6 checks passed
@martindevans martindevans deleted the fix/more_march_native branch November 5, 2023 15:59
@martindevans
Copy link
Member Author

@lexxsoft any testing you can do over on that other PR would be much appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants