Merge pull request #1398 from karthik2804/update_llm_docs

update folder structure for llm in spin 3.0
fermyon · Oct 20, 2024 · c42b27a · c42b27a
2 parents 92900d7 + ec9dfcb
commit c42b27a
Show file tree

Hide file tree

Showing 4 changed files with 24 additions and 31 deletions.
diff --git a/content/spin/v1/ai-sentiment-analysis-api-tutorial.md b/content/spin/v1/ai-sentiment-analysis-api-tutorial.md
@@ -32,6 +32,8 @@ url = "https://github.com/fermyon/developer/blob/main/content/spin/v1/ai-sentime
 - [Conclusion](#conclusion)
 - [Next Steps](#next-steps)
 
+> This tutorial does not work with Spin `v3.0` or above as the on disk representaiton of the models have changed. Refer to the [V3 tutorial](/spin/v3/ai-sentiment-analysis-api-tutorial) depending on your Spin version.  
+
 Artificial Intelligence (AI) Inferencing performs well on GPUs. However, GPU infrastructure is both scarce and expensive. This tutorial will show you how to use Fermyon Serverless AI to quickly build advanced AI-enabled serverless applications that can run on Fermyon Cloud. Your applications will benefit from 50 millisecond cold start times and operate 100x faster than other on-demand AI infrastructure services. Take a quick look at the video below to learn about executing inferencing on LLMs with no extra setup.
 
 <iframe width="854" height="480" src="https://www.youtube.com/embed/01oOh3D9cVQ?si=wORKmuOkeFMGYBsQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

diff --git a/content/spin/v2/ai-sentiment-analysis-api-tutorial.md b/content/spin/v2/ai-sentiment-analysis-api-tutorial.md
@@ -31,6 +31,8 @@ url = "https://github.com/fermyon/developer/blob/main/content/spin/v2/ai-sentime
 - [Conclusion](#conclusion)
 - [Next Steps](#next-steps)
 
+> This tutorial does not work with Spin `v3.0` or above as the on disk representaiton of the models have changed. Refer to the [V3 tutorial](/spin/v3/ai-sentiment-analysis-api-tutorial) depending on your Spin version.  
+
 Artificial Intelligence (AI) Inferencing performs well on GPUs. However, GPU infrastructure is both scarce and expensive. This tutorial will show you how to use Fermyon Serverless AI to quickly build advanced AI-enabled serverless applications that can run on Fermyon Cloud. Your applications will benefit from 50 millisecond cold start times and operate 100x faster than other on-demand AI infrastructure services. Take a quick look at the video below to learn about executing inferencing on LLMs with no extra setup.
 
 <iframe width="854" height="480" src="https://www.youtube.com/embed/01oOh3D9cVQ?si=wORKmuOkeFMGYBsQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

diff --git a/content/spin/v3/ai-sentiment-analysis-api-tutorial.md b/content/spin/v3/ai-sentiment-analysis-api-tutorial.md
@@ -13,7 +13,6 @@ url = "https://github.com/fermyon/developer/blob/main/content/spin/v3/ai-sentime
 - [Serverless AI Inferencing With Spin Applications](#serverless-ai-inferencing-with-spin-applications)
   - [Creating a New Spin Application](#creating-a-new-spin-application)
   - [Supported AI Models](#supported-ai-models)
-  - [Model Optimization](#model-optimization)
   - [Application Structure](#application-structure)
   - [Application Configuration](#application-configuration)
   - [Source Code](#source-code)
@@ -45,7 +44,7 @@ In this tutorial we will:
 
 ### Spin 
 
-You will need to [install the latest version of Spin](install#installing-spin). Serverless AI is supported on Spin versions 1.5 and above. 
+You will need to [install the latest version of Spin](install#installing-spin). This tutorial requires Spin 3.0 or greater. 
 
 If you already have Spin installed, [check what version you are on and upgrade](upgrade#are-you-on-the-latest-version) if required.
 
@@ -173,10 +172,6 @@ Fermyon's Spin and Serverless AI currently support:
 - Meta's open source Large Language Models (LLMs) [Llama](https://ai.meta.com/llama/), specifically the `llama2-chat` and `codellama-instruct` models (see Meta [Licenses](#licenses) section above).
 - SentenceTransformers' [embeddings](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) models, specifically the `all-minilm-l6-v2` model.
 
-### Model Optimization
-
-The models need to be in a particular format for Spin to be able to use them (quantized, which is a form of optimization). The official download links for the models (in non-quantized format) are listed in the previous section. However, for your convenience, the code examples below fetch models which are already in the special quantized format.
-
 ### Application Structure
 
 Next, we need to create the appropriate folder structure from within the application directory (alongside our `spin.toml` file). The code below demonstrates the variations in folder structure depending on which model is being used. Once the folder structure is in place, we then fetch the pre-trained AI model for our application:
@@ -187,15 +182,7 @@ Next, we need to create the appropriate folder structure from within the applica
 
 > Ensure you have read the Meta [Licenses](#licenses) section before continuing to use Llama models.
 
-<!-- @selectiveCpy -->
-
-```bash
-# llama2-chat
-$ mkdir -p .spin/ai-models/llama
-$ cd .spin/ai-models/llama
-$ wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/a17885f653039bd07ed0f8ff4ecc373abf5425fd/llama-2-13b-chat.ggmlv3.q3_K_L.bin
-$ mv llama-2-13b-chat.ggmlv3.q3_K_L.bin llama2-chat
-```
+Download the the `*.safetensors`, `config.json` and `tokenizer.json` from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b-hf) and place it in the following structure below. The `.spin` directory needs to be placed in the root of the Spin project.
 
 <!-- @nocpy -->
 
@@ -205,21 +192,16 @@ tree .spin
 └── ai-models
     └── llama
         └── llama2-chat
+            └── <*.safetensors files>
+            └── config.json
+            └── tokenizor.json
 ```
 
 **codellama-instruct example download**
 
 > Ensure you have read the Meta [Licenses](#licenses) section before continuing to use Llama models.
 
-<!-- @selectiveCpy -->
-
-```bash
-# codellama-instruct
-$ mkdir -p .spin/ai-models/llama
-$ cd .spin/ai-models/llama
-$ wget https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGML/resolve/b3dc9d8df8b4143ee18407169f09bc12c0ae09ef/codellama-13b-instruct.ggmlv3.Q3_K_L.bin
-$ mv codellama-13b-instruct.ggmlv3.Q3_K_L.bin codellama-instruct
-```
+Download the `*.safetensors`, `config.json` and `tokenizer.json` from [huggingface](https://huggingface.co/meta-llama/CodeLlama-7b-hf/tree/main) and place it in the following structure below.
 
 <!-- @nocpy -->
 
@@ -229,6 +211,9 @@ tree .spin
 └── ai-models
     └── llama
         └── codellama-instruct
+            └── <*.safetensors files>
+            └── config.json
+            └── tokenizor.json
 ```
 
 **all-minilm-l6-v2 example download**

diff --git a/content/spin/v3/serverless-ai-api-guide.md b/content/spin/v3/serverless-ai-api-guide.md
@@ -32,19 +32,23 @@ ai_models = ["codellama-instruct"]
 // -- snip --
 ```
 
-> Spin supports "llama2-chat" and "codellama-instruct" for inferencing and "all-minilm-l6-v2" for generating embeddings.
+> Spin supports the models of the Llama architecture for inferencing and "all-minilm-l6-v2" for generating embeddings.
 
 ### File Structure
 
-By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a `.spin/ai-models/` file path of a given application. For example:
+By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a `.spin/ai-models/` file path of a given application.
+Within the `.spin/ai-models` directory, models of the same architecture (e.g. `llama`) must be grouped under a directory with the same name as the architecture.
+Within an architecture directory, each individual model (e.g. `llama2-chat`, `codellama-instruct`) must be placed under a folder with the same name as the model.  So for any given model, that files for the model are placed in the directory `.spin/ai-models/<architecture>/<model>`.  For example:
 
 ```bash
-code-generator-rs/.spin/ai-models/llama/codellama-instruct
+code-generator-rs/.spin/ai-models/llama/codellama-instruct/safetensors
+code-generator-rs/.spin/ai-models/llama/codellama-instruct/config.json
+code-generator-rs/.spin/ai-models/llama/codellama-instruct/tokenizer.json
 ```
 
 See the [serverless AI Tutorial](./ai-sentiment-analysis-api-tutorial) documentation for more concrete examples of implementing the Fermyon Serverless AI API, in your favorite language.
 
-> Embeddings models are slightly more complicated; it is expected that both a `tokenizer.json` and a `model.safetensors` are located in the directory named after the model. For example, for the `foo-bar-baz` model, Spin will look in the `.spin/ai-models/foo-bar-baz` directory for `tokenizer.json` and a `model.safetensors`.
+> For embeddings models, it is expected that both a `tokenizer.json` and a `model.safetensors` are located in the directory named after the model. For example, for the `foo-bar-baz` model, Spin will look in the `.spin/ai-models/foo-bar-baz` directory for `tokenizer.json` and a `model.safetensors`.
 
 ## Serverless AI Interface
 
@@ -54,9 +58,9 @@ The set of operations is common across all supporting language SDKs:
 
 | Operation | Parameters | Returns | Behavior |
 |:-----|:----------------|:-------|:----------------|
-| `infer`  | model`string`<br /> prompt`string`| `string`  | The `infer` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `llama2-chat`, `codellama-instruct`, or other; passed in as a `string`).<br /> <br />The second parameter is a prompt; passed in as a `string`.<br />|
-| `infer_with_options`  | model`string`<br /> prompt`string`<br /> params`list` | `string`  | The `infer_with_options` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `llama2-chat`, `codellama-instruct`, or other; passed in as a `string`).<br /><br /> The second parameter is a prompt; passed in as a `string`.<br /><br /> The third parameter is a mix of float and unsigned integers relating to inferencing parameters in this order: <br /><br />- `max-tokens` (unsigned 32 integer) Note: the backing implementation may return less tokens. <br /> Default is 100<br /><br /> - `repeat-penalty` (float 32) The amount the model should avoid repeating tokens. <br /> Default is 1.1<br /><br /> - `repeat-penalty-last-n-token-count` (unsigned 32 integer) The number of tokens the model should apply the repeat penalty to. <br /> Default is 64<br /><br /> - `temperature` (float 32) The randomness with which the next token is selected. <br /> Default is 0.8<br /><br /> - `top-k` (unsigned 32 integer) The number of possible next tokens the model will choose from. <br /> Default is 40<br /><br /> - `top-p` (float 32) The probability total of next tokens the model will choose from. <br /> Default is 0.9<br /><br /> The result from `infer_with_options` is a `string` |
-| `generate-embeddings`  | model`string`<br /> prompt`list<string>`| `string`  | The `generate-embeddings` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `all-minilm-l6-v2`, passed in as a `string`).<br /> <br />The second parameter is a prompt; passed in as a `list` of `string`s.<br /><br /> The result from `generate-embeddings` is a two-dimension array containing float32 type values only |
+| `infer`               | model`string`<br /> prompt`string`                    | `string` | The `infer` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `llama2-chat`, `codellama-instruct`, or other; passed in as a `string`).<br /> <br />The second parameter is a prompt; passed in as a `string`.<br />                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `infer_with_options`  | model`string`<br /> prompt`string`<br /> params`list` | `string` | The `infer_with_options` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `llama2-chat`, `codellama-instruct`, or other; passed in as a `string`).<br /><br /> The second parameter is a prompt; passed in as a `string`.<br /><br /> The third parameter is a mix of float and unsigned integers relating to inferencing parameters in this order: <br /><br />- `max-tokens` (unsigned 32 integer) Note: the backing implementation may return less tokens. <br /> Default is 100<br /><br /> - `repeat-penalty` (float 32) The amount the model should avoid repeating tokens. <br /> Default is 1.1<br /><br /> - `repeat-penalty-last-n-token-count` (unsigned 32 integer) The number of tokens the model should apply the repeat penalty to. <br /> Default is 64<br /><br /> - `temperature` (float 32) The randomness with which the next token is selected. <br /> Default is 0.8<br /><br /> - `top-k` (unsigned 32 integer) The number of possible next tokens the model will choose from. <br /> Default is 40<br /><br /> - `top-p` (float 32) The probability total of next tokens the model will choose from. <br /> Default is 0.9<br /><br /> The result from `infer_with_options` is a `string` |
+| `generate-embeddings` | model`string`<br /> prompt`list<string>`              | `string` | The `generate-embeddings` is performed on a specific model.<br /> <br />The name of the model is the first parameter provided (i.e. `all-minilm-l6-v2`, passed in as a `string`).<br /> <br />The second parameter is a prompt; passed in as a `list` of `string`s.<br /><br /> The result from `generate-embeddings` is a two-dimension array containing float32 type values only                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
 
 The exact detail of calling these operations from your application depends on your language: