A large language model, in the context of artificial intelligence and natural language processing, refers to a machine learning model that has been trained on a vast amount of text data to understand and generate human-like language. These models are typically designed to process and generate human language in a way that is contextually relevant and coherent.
The term "large" in this context usually indicates the size of the neural network architecture used to build the model. Larger models have more parameters, allowing them to capture more intricate patterns and nuances in language. These models are trained on diverse datasets, often containing a substantial portion of the internet's text, to learn grammar, vocabulary, facts, reasoning abilities, and even some degree of common sense.
One of the prominent examples of a large language model is OpenAI's GPT (Generative Pre-trained Transformer) series, such as GPT-3. These models have billions of parameters and exhibit impressive capabilities in tasks like language understanding, text completion, translation, summarization, and more. Large language models have found applications in various fields, including natural language processing, chatbots, content generation, and assisting with complex problem-solving tasks.
The AI Toolbox implementation is based on the HuggingFace pipelines, which provide a common interface for LLMs. Pipelines input some text (commonly known as prompt) and returns text generated by the LLM (that's why they are called generative models). Pipelines typically consists of five (or more) basic phases:
- Tokenize the prompt
- Embed the tokens into a real-valued high dimension space
- Run the embeddings through the language model
- Decode the output using a specified technique
- Lookup the tokens for the embeddings returned by the model
Tokenization splits the input text into pieces, mainly words or smaller part. Tokenizer typically is related to the actual model seletected.
LLM return probabilities of the next tokens. However, selecting the next token based on the probabilities is not straightforward. Common strategies are:
- Greedy Search
- Contrastive search
- Beam-search
- etc.
Read about strategies in more detail in the HuggaginFace page.
HuggingFace supports a diverse set of LLM models, see here.
Fine tuning of LLMS for specific problems (e.g. context aware question-answering, etc.) can be divided into three groups:
- Transfer learning the whole network
- Transfer learning the attention network
- Context injection (also known as Retrieval Augmented Generation)
Fine-tuning the whole network is very resource demanding, so it is not recommended generally. However, fine-tuning of the attention network can be performed easily, and the result is moderate both in size and resource demand. For fine-tuning e.g. the Falcon model, please see this article. Fine tuning by transfer learning requires prompt-response pairs to train on.
Context injection is much easier and more lightweight as transfer learning. Context injection is based on injecting context information into the prompt, so the LLM can be used as-is. An example for context injected prompt:
Answear the question bellow using the context provided!
CONTEXT: John Smith was born in 1956. Today is december 12, 2023.
QUESTION: How old is John Smith?
The context can be generated in variety of ways, see e.g. the Context-Injection Tool.
By default, the tool contains a prepared LLama 3 7B model, as a deployable service in the file query.ipynb
. LLama 3 7B requires at least 16GB of GPU memory! The inputs for the service:
token
: a predefined token for security reasons (string)system_prompt
: the system prompt (string)user_prompt
: the user prompt
The responses:
resp
: the response of the LLM model (string)