Skip to content

Commit

Permalink
llama_decode lock (SciSharp#595)
Browse files Browse the repository at this point in the history
* Added a lock object into `SafeLlamaModelHandle` which all calls to `llama_decode` (in the `SafeLLamaContextHandle`) lock first. This prevents two contexts from running inference on the same model at the same time, which seems to be unsafe in llama.cpp.

* Modified the lock to be global over _all_ inferences. This seems to be necessary (at least with the CUDA backend).
  • Loading branch information
martindevans authored Mar 13, 2024
1 parent 9deb50a commit ce4de7d
Showing 1 changed file with 15 additions and 2 deletions.
17 changes: 15 additions & 2 deletions LLama/Native/SafeLLamaContextHandle.cs
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,18 @@ public uint TokenToSpan(LLamaToken token, Span<byte> dest)
#endregion

#region infer
/// <summary>
/// This object exists to ensure there is only ever 1 inference running at a time. This is a workaround for thread safety issues in llama.cpp itself.
/// Most notably CUDA, which seems to use some global singleton resources and will crash if multiple inferences are run (even against different models).
///
/// For more information see these issues:
/// - https://github.com/SciSharp/LLamaSharp/issues/596
/// - https://github.com/ggerganov/llama.cpp/issues/3960
///
/// If these are ever resolved this lock can probably be removed.
/// </summary>
private static readonly object GlobalInferenceLock = new();

/// <summary>
/// </summary>
/// <param name="batch"></param>
Expand All @@ -202,8 +214,9 @@ public uint TokenToSpan(LLamaToken token, Span<byte> dest)
/// </returns>
public DecodeResult Decode(LLamaBatch batch)
{
using (batch.ToNativeBatch(out var nb))
return (DecodeResult)NativeApi.llama_decode(this, nb);
lock (GlobalInferenceLock)
using (batch.ToNativeBatch(out var nb))
return (DecodeResult)NativeApi.llama_decode(this, nb);
}

/// <summary>
Expand Down

0 comments on commit ce4de7d

Please sign in to comment.