In the previous section we have seen that a trained language model is less capable of representing a visual input than a trained language model (both with similar transformer architectures). Given the nature of the inputs each model type is trained on, this may not seem very unexpected. It is more informative to consider the ability of language model layer outputs to represent language inputs, rather than images. More accurate representation implies more input information passing through the model to the chosen layer's output.
To orient ourselves, first consider the architecture of a typical transformer-based language model.
Language input generation presents a unique challenge to gradient-based optimization methods because language inputs are fundamentally discrete: a word either exists in a certain part of a sentence or it does not. This is a notable contrast to images of the natural world, where each element (ie pixel) may take any value between 0 and the maximum possible intensity of that image (which is 255. for 8-bit images) and therefore are much better approximated by continuous values.
The standard approach to observing the input information present for some vision model layer is to start with a random normal input
with
This gradient descent method is not useful for language models without some modifications, given that
There are two ways that we can perform gradient descent on tokenized inputs: the first is to first convert the tokens to a lower-dimensional (and continuous) embedding and then perform gradient descent on that embedding, and the other is to convert to the tokens into a
For large language models such as GPT-2, the process of conversion between a discrete token and a continuous embedding occurs using a word-token embedding, which is programmed as a fully connected layer without biases but is equivalent to a (usually full-rank) matrix multiplication of the input token vector
As
But we then need a way to convert
Instead we can use a generalized inverse, also known as the Moore-Pensore pseudo-inverse. The psuedo-inverse of
which is the limit from above as
where
Therefore we can instead perform gradient descent on an initially random embedding
and then recover the generated input
where each token of
Thus we can make use of the pseudo-inverse to convert an embedding back into an input token. To see how this can be done with an untrained implementation of GPT-2, the model and tokenizer may be obtained as follows:
import torch
from transformers import GPT2Config, GPT2LMHeadModel
configuration = GPT2Config()
model = GPT2LMHeadModel(configuration)
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'gpt2')
Suppose we were given a prompt and a tokenizer to transform this prompt into a tensor corresponding to the tokens of each word.
prompt = 'The sky is blue.'
tokens = tokenizer.encode(
prompt,
add_special_tokens=False,
return_tensors='pt',
max_length = 512,
truncation=False,
padding=False
).to(device)
For GPT-2 and other transformer-based language models, the total input embedding fed to the model
model = model.to(device)
embedding = model.transformer.wte(tokens)
position_ids = torch.tensor([i for i in range(len(tokens))]).to(device)
positional_embedding = model.transformer.wpe(position_ids)
embedding += positional_embedding
The positional weight matrix is invariant for any given input length and thus may be added and subtracted from the input embedding so we do not have to solve for this quantity. Therefore given the embedding
variable, we can generate the input tokens by first subtracting the positional embedding
and this may be implemented as
embedding_weight = model.transformer.wte.weight
inverse_embedding = torch.linalg.pinv(embedding_weight)
logits = torch.matmul(embedding - positional_embedding, inverse_embedding)
tokens = torch.argmax(logits, dim=2)[0]
It may be verified that Equation \eqref{eq3} is indeed capable of recovering the input token given an embedding by simply encoding any given sentence, converting this encoding to an embedding and then inverting the embedding to recover the input encoding.
Before investigating the representations present in a large and non-invertible model such as GPT-2, we can first observe whether a small and invertible model is capable of accurate input representation (from its output layer). If the gradient descent procedure is sufficiently powerful, we would expect for any input sentence to be able to be generated exactly from pure noise.
The following model takes as input an embedding tensor with hidden_dim
dimension with the number of tokens being input_length
and is invertible. This MLP is similar to the transformer MLP, except without a residual connection and normalization and with equal to or more output elements as there are input elements for each layer (which means that the model is invertible assuming that the GeLU transformation does not zero out many inputs).
class FCNet(nn.Module):
def __init__(self, input_length=5, hidden_dim=768):
super().__init__()
self.input = nn.Linear(input_length * hidden_dim, 4 * input_length * hidden_dim)
self.h2h = nn.Linear(4 * input_length * hidden_dim, 4 * input_length * hidden_dim)
self.input_length = input_length
self.gelu = nn.GELU()
def forward(self, x):
x = x.flatten()
x = self.input(x)
x = self.gelu(x)
x = self.h2h(x)
return x
Given the target input
and starting with Gaussian noise as
meaning that our small fully connected model has been successfully inverted.
It may be wondered if a non-invertible model (containing one or more layer transformations that are non-invertible) would be capable of exactly representing the input. After all, the transformer MLP is non-invertible as it is four times smaller than the middle layer. If we change the last layer of our small MLP to have input_length * hidden_dim
elements, we find that the generated inputs
These representations yield the same next character output for a trained GPT-2, indicating that they are considered to be nearly the same as the target input with respect to that model as well.
To get an idea of how effective our gradient procedure is in terms of metric distances, we can construct a shifted input
Feeding
as an estimate for 'close' to
with the distance of output of this representation for the one-block GPT-2
such that
To recap the last section, on this page we will observe the information present in each layer of a given language model by performing gradient descent in embedding space before converting the generated embedding back to tokens. The following figure summarizes how this works for a typical transformer-based model.
One of the primary challenges of large language models today is their ability to generate text that is gramatically and stylistically accurate to the prompt but is inaccurate in some other way, either introducing incorrect information about a topic or else veering off in an unintended direction.
It can be shown, however, that these models are capable of a much more extreme translation from input nonsense into some real language output by making use the the input representations we have generated in the previous section. Suppose one were given the following prompt:
Feeding this input into a trained GPT-2, we get the very reasonable
But it can also be shown that one can find many completely nonsensical inputs that also yield an identical output. We will see this first with an untrained version of GPT-2 that has been tructated to include a certain number (below only one) of transformer blocks followed by the language modeling head. The language modeling head allows us to obtain the next predicted word for an input into this model, which provides one measure of 'closeness' if our generated sentence has the same next predicted word as the target input.
class AbbreviatedGPT(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, x: torch.Tensor):
# choose the block depth
for i in range(1):
x = self.model.transformer.h[i](x)[0]
x = self.model.lm_head(x)
return x
When we generate an input after a few hundred iterations of Equation \eqref{eq2}, passing in the resulting embeddings to be inverted by Equation \eqref{eq3} for the target input
we have
If an (untrained) language modeling head is attached to this first transformer block, we find that these two inputs really are viewed essentially equivalently for the purposes of causal language modeling, in the sense that the next predicted token for both is
If we increase the number of maximum interations
And increasing the total iterations
although most generated prompts are close but not quite equal to the original for
With an increase in the number of transformer blocks before the output modeling head, it becomes more difficult to recover the target inptut
Using the full 12 transformer blocks of an untrained (base) GPT-2, followed by the language modeling head (parameters
effectively minimize the
So far we have only considered input representations from untrained models. It may be wondered what the training process does to the model representational ability, and to do so we will use the same abbreviated model configuration above (with GPT-2 transformer blocks following the input and positional embedding and ending in the language modeling head output).
When performing the input representation procedure detailed above on a trained GPT-2 (importantly with a language modeling head attached), the first thing to note is that the model appears to be very poorly conditioned such that using gradient descent to modify an input to match some output requires careful tuning of
on the order to one hundred times as many as for the untrained model to be precise. This very slow minimization of the output loss also occurs when the gradient is calculated on is a different metric, perhaps
There is a trivial way to satisfy \eqref{eq4} for any
Given an input
and with
Note, however, that even values
Returning to the general case, it takes an extremely large number of iterations of \eqref{eq2} to approach the inequality \eqref{eq4}, and often it cannot be satisfied in a feasible number of iterations at all. This observation suggests that the trained GPT-2 model cannot accurately pass gradients from the output to the input. Why would the gradients arriving at the early layers of a model be inaccurate? It may be wondered if this is due to rounding errors in the backpropegation of gradients. One way to check this is to convert both the model and inputs in question to torch.double()
type, ie 64-bit rather than the default 32-bit floating point values. Unfortunately there is no significant change in the number of iterations required to make an input that satisfies \eqref{eq4}, and it remains infeasible to satisfy that inequality for
The relative inability of gradient updates to the input embedding to minimize a loss function on the model output suggests that model layers that are adjacent in the backpropegation computational graph (ie the first few transformer encoders) are also poorly optimized towards the end of training. Indeed, the poor optimization to the input embedding given only one trained transformer block suggests that most of the model is poorly updated towards the end of training, and that only a few output layers are capable of effective updates at this point.
Is there any way to use gradient descent to more effectively minimize some metric distance between a trained model's output of
This result indicates that even when capable of minimizing an
So far we have seen something somwhat unexpected: given some small input token array
We can select the indicies of the top 5 most activated input token positions as follows:
tokens = torch.topk(logits, 5)[1][0] # indicies of topk of tensor
Given
For the invertible fully connected network following a token-to-embedding transformation from a trained GPT-2 (as used above) one can see that successive outputs are semantically similar, which is perhaps what one would expect given that a trained embedding would be expected to recover words that mean approximately the same thing (ie 'speedy' is an approximate synonym for 'prompt'). The top five input token strings are
This is a prompt sentence.
this was an Prompt sentences.(
this are the promptssent,
This has another speedy paragraph.[
THIS isna prompted Sent."
but for a non-invertible fully connected model (with layer dims of
this is a prompt sentence.
thisModLoader an Prompt sentences.–
This corridikumanngthlems.<
THIS are another guiActiveUnDownloadha Mechdragon
Thisuberty theetsk recomm.(
The non-invertible model results above are not wholly surprising given that the model used was not trained such that equivalent inputs would not be expected to be very semantically similar. But a model composed of a single trained GPT-2 transformer block (no language modeling head) yields only gibberish as well.
PsyNetMessage▬▬ MarketablePsyNetMessagePsyNetMessagePsyNetMessage
srfAttachPsyNetMessagePsyNetMessagequerquequerqueartifacts
ocamp��極artifacts��極 unfocusedRange Marketable
irtualquerqueanwhileizontartifactsquerque
Accessorystaking-+-+ザigslistawaru
But when we compare the input representation of one transformer block of a trained model (above) to input representations for one transformer block for a randomly initialized and untrained model (below) we see something interesting: not only does training remove the ability of the first GPT-2 transformer decoder to accurately recover the input sentence from information in the output of that block, but the words corresponding to the input representation of a trained model are much less likely to exist in a real sentence than the decoded input representation by an untrained model. Specifically, observe that Marketable
is the only word that would be likely to be ever found in real text above, whereas nearly every word below would likely be found given text of sufficient length.
This together a prompt sentence.
stock PER alotebra benevolentNBC
estab hydraulic declaration Dyn highlighted Caucas
しFormatorean DAM addressedball
ogeneity shows machine freed erasedcream
This observation suggests that training does indeed confer some ability to distinguish between real sentences: all nearly-sentence representations exhibited by the untrained model are no longer found near the target representation of the trained model, and only inputs that have almost zero probability of appearing in text are observed.
For the investigation on this page, the main purpose of forming inputs based on the values of the output of some hidden layer is that the input generated gives us an understanding of what kind of information that hidden layer contains. Given a tensor corresponding to output activations of a layer, the information provided by that tensor is typically invariant to some changes in the input such that one can transform one possible input representation into another without changing the output values significantly. The collection of all such invariants may be thought of as defining the information contained in a layer.
Therefore one can investigate the information contained in some given layer by observing what can be changed in the input without resulting in a significant change in the output, starting with an input given in the data distribution. The process of moving along in low-dimensional manifold in higher-dimensional space is commonly referred to a 'manifold walk', and we will explore methods that allow one to perform such a walk in the input space, where the lower-dimensional manifold is defined by the (lower-dimensional) hidden layer.
There are a number of approaches to finding which changes in the input are invariant for some model layer output, and of those using the gradient of the output as the source of information there are what we can call direct and indirect methods. Direct methods use the gradient of some transformation on the output applied to the input directly, whereas indirect methods transform the gradient appropriately and apply the transformed values to the input.
We will consider an indirect method first before proceeding to a direct one.
Given the gradient of the output with respect to the input, how would one change the input so as to avoid changing the output? Recall that the gradient of the layer output with respect to the input,
expresses the information of the direction (in
def layer_gradient(model, input_tensor):
...
input_tensor.requires_grad = True
output = a_model(input_tensor)
loss = torch.sum(output)
loss.backward()
gradient = input_tensor.grad
return gradient, loss.item()
If our purpose is to instead avoid changing the layer's output we want what is essentially the opposite of the gradient, which may be thought of as some direction in
How can we find an orthogonal vector to the gradient? In particular, how may we find an orthogonal vector to the gradient, which is typically a non-square tensor? For a single vector
We can find that trivially setting
To do so, we can make use of the singular value decomposition of the input gradient matrix. Given a matrix
and may be thought of as an extension of the process of eigendecomposition of a matrix into orthonormal bases to matricies that are non-square. Here the columns of the matrix
The singular value decomposition has a number of applications in linear algebra, but for this page we only need to know that if
Now that a number of orthogonal vectors have been obtained, we can update the embedding
where
For example, let's consider the input
This is a 5-token input for GPT-2, where the embedding corresponding to this input is of dimension
The orthogonal vector approach may therefore be implemented as follows:
def tangent_walk(embedding, steps):
for i in range(steps):
embedding = embedding.detach()
gradient, _ = layer_gradient(a_model, embedding) # defined above
gradient = torch.squeeze(gradient, dim=0)
perp_vector = torch.linalg.svd(gradient).Vh[-1] # any index greater than input_length
embedding = embedding + 0.01*perp_vector # learning rate update via automatically broadcasted vector
return embedding
where we can check that the SVD gives us sufficiently orthogonal vectors by multiplying the perp_vector
by gradient
via
print (gradient @ perp_vector) # check for orthogonality via mat mult
If the value returned from this multiplication is sufficiently near zero, the perp_vector
is sufficiently orthogonal to the gradient
vector and should not change the value of
Note that gradients are functions from scalars to vectors (or tensors etc.) rather than functions from vectors to other vectors, meaning that we must actually taking the gradient
which may be implemented as follows:
def individual_gradient(model, input_tensor):
input_tensor.requires_grad = True
gradient = torch.zeros(input_tensor.shape).double().to(device)
output = a_model(input_tensor)
for i in range(len(output)):
output = a_model(input_tensor)
loss = output[i]
loss.backward()
gradient += input_tensor.grad
input_tensor.grad = None
return gradient
But it is much more efficient to calculate the gradient
which may be implemented as
def summed_gradient(model, input_tensor):
input_tensor.requires_grad = True
output = a_model(input_tensor)
sout = torch.sum(output)
sout.backward()
gradient = input_tensor.grad.detach().clone()
return gradient
It may be empirically verified that Pytorch evaluates these gradients nearly equally, as the difference between the input gradient as calculated by these methods is on the order of
Equivalently, the above method is computationally identical to
def layer_gradient(model, input_tensor):
input_tensor.requires_grad = True
output = a_model(input_tensor)
output.backward(gradient=torch.ones_like(output).to(device))
gradient = input_tensor.grad.detach().clone()
return gradient
This completes the relevant details of the orthogonal walk. Unfortunately this method is not capable of finding new locations in perp_vector
is usually not very accurately perpendicular to the gradient
vector such that multiplication of the orthogonal vector and the gradient returns values on the order of
Therefore instead of applying the orthogonal walk approach firectly to the GPT-2 model, we can first apply it to a model architecture that allows the SVD to make accurate orthogonal vectors, the instability of the input gradient landscape makes finite learning rates give significant changes in the output, which we do not want. To gain an understanding for what the problem is, suppose one uses the model architecture mimicking the transformer MLP (with three layers, input and output being the embedding dimension of 768 and the hidden layer 4*768). Obtaining an orthogonal vector from [ 3.7253e-09, -2.3283e-09, 1.9558e-08, -7.4506e-09, 3.3528e-08]
, indicating that we have indeed found an approximately orthogonal vector to the gradient for all inputs.
But when we compare the torch.float64
precision, there is typically a larger change in the output than the input although the orthogonal vectors multiplied by the columns of
These results indicate that the orthogonal walk method is not capable of changing the input whilst leaving the output vector unchanged for all except the smallest of models. Nevertheless, for the untrained 3-layer MLP after a few dozen iterations we have
Although the orthogonal walk method is not particularly successful when applied to GPT-2, we can nevertheless observe a sequence of generated inputs
When torch.float64
precision is stipulated we find that a much more accurate orthogonal vector may be found, where for any column of the input gradient vector times the right-singular broadcasted vector,
In conclusion, one can walk along a manifold in
Another approach to changing the input while fixing the output is to clamp portions of the input, add some random amount to those clamped portions, and perform gradient descent on the rest of the input in order to minimize a metric distance between the modified and original output. The gradient we want may be found via
def target_gradient(model, input_tensor, target_output):
...
input_tensor.requires_grad = True
output = a_model(input_tensor)
loss = torch.sum(torch.abs(output - target_output))
print (loss.item())
loss.backward()
gradient = input_tensor.grad
return gradient
and the clamping procedure may be implemented by choosing a random index on the embedding dimension and clamping the embedding values of all tokens up to that index, shifting those values by adding these to random normally distributed ones
where
def clamped_walk(embedding, steps, rand_eta, lr):
with torch.no_grad():
target_output = a_model(embedding)
embedding = embedding.detach()
for i in range(steps):
clamped = torch.randint(768, (1,))
shape = embedding[:, :, clamped:].shape
embedding[:, :, clamped:] += rand_eta*torch.randn(shape).to(device)
gradient = target_gradient(a_model, embedding, target_output)
embedding = embedding.detach()
embedding[:, :, :clamped] = embedding[:, :, :clamped] - lr*gradient[:, :, :clamped]
return embedding
This technique is far more capable of accomplishing our goal of changing
may be further increased to nearly
It is interesting to note that the transformer architecture is much more amenable to this gradient clamping optimization than the fully connected model, which generally does not yield an
Now that a method of changing the input such that the corresponding change to the output is minimized has been found, we can choose appropriate values of
Once again, we find that the trained GPT-2 approximates an English sentence with gibberish. This is true even if we take the top-k nearest tokens to
elsius skyelsiuselsius.
advertisementelsiusadvertisementascriptadvertisement
eaturesascripteaturesadvertisementelsius
destroadvertisementhalla blue.]
thelessbiltascript":[{"bilt
It is interesting to note that practically every example of a poorly-formed input representation we have seen on this page suffers from some degree or other of repetition. Take the top-1 token input found above with spaces added for clarity:
where
This is interesting in light of the observation that language models (particularly smaller ones) often generate repeated phrases when instructed to give an output of substantial length. This problem is such that efforts have been made to change the output decoding method: for example Su and colleages introduced contrastive search for decoding as opposed to simply decoding the model output as the token with the largest activation (which has been termed a 'greedy' decoding approach) during autoregression.
The tendancy language models tend to generate repetitive text during autoregression has been attributed by Welleck and colleages to the method by which language models are usually trained, ie maximum likelihood on the next token in a string. The authors found that two measures ameliorate this repetition: modifying the objective function ('maximum unlikelihood estimation') and modifying the decoding method to instead perform what is called a beam search. For all inputs
Beam search instead attempts to maximize the total likelihood over a sequence of tokens rather than just one. For two tokens, this is
where
Returning to our model representation findings, it is clear that GPT-2 indeed tends to be unable to distinguish between repetitive sequences and true sentences. But viewed through the lense of representation theory, there is a clear reason why this would be: training has presumably never exposed the model to a sequence like
On the other hand, it is also clear that there is indeed some kind of penalty placed on this phrase because it should never appear during training. This is analagous to the idea of opportunity cost in economics, in which simply not choosing a profitable activity may be viewed as incurring some cost (the lost profit). Here a model that generates a nonsensical phrase is penalized in the sense that this output could not possibly give the model any benefit with regards to maximum likelihood training on the next token, whereas even a very unlikely but gramatically correct phrase could appear in some text and therefore could be rewarded.
Equivalently, observe that a model that has effectively minimized its log-likelihood objective function should not repeat words or phrases unless such repetitions were found in the training data, as errors such as these (on a per-token basis) would necessarily increase the negative log-likelihood. Therefore all language models of sufficiently high capacity, when trained on enough data are expected to avoid repetition.
These observations suggest that if one were to train a language model of sufficiently large capacity on a sufficiently large dataset, repetition would cease to be found as this model would more effectively minimize negative log-likelihood. From the performance of more recent and larger models than GPT-2 we can see that this may indeed be correct. Does this mean that the representation in larger models trained on more data will also be less prone to repetition?
The first part of this question answered by performing the same input representation procedure on larger versions of GPT-2, where the models have been trained on the same dataset (40 GB of text from Reddit links mostly). On this page we have thus far considered the base GPT-2 model with 117M parameters, and now we can observe the input representation repetition for the gpt2-xl
with 1.6B parameters. Once we have downloaded the appropriate trained models from the HuggingFace transformers store, which for gpt2-xl
may be done as follows:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
we find that optimizing an input to match the output of the first transformer block is perhaps easier than it was for the gpt2-base
model with a tenth of the parameters, where a mere 20 iterations of gradient descent is more than enough to satisfy the desired inequality in which the measure of distance between the generated input's output is smaller than the shifted input's output as described in \eqref{eq4}.
".[.>>.–.–.–
ADA.–annabin.>>.>>
kus".[ADAwanaiannopoulos
iott Gleaming.>>".[olor
Gleamingoloriottiannopouloswana
Interestingly, however, we can see that there is now much less repetition for each topk decoding. The same is true if we use the first 12 transformer decoders, which yield a top5 input representation as follows:
iets largeDownloadnatureconservancy".[twitter
udeauhexDonnellramids".[
largeDownload2200ledgePsyNetMessage>[
retteimanclipsTPPStreamerBot王
Citizenudeau2200brycekson
and for a 24-block hidden architecture the top5 decodings of the input representation are
Marketable (" Dragonboundju.>>
ossessionoptimisoliband().
minecraftsocketramid,...!.
iczisphere Canyonpiresaer
eloowntgur Pastebinnatureconservancy
and even for 40 blocks,
iczisphere Dragonboundheit\
umerableaddonszoneNT>.
auerohnramidohlSTEM
ozisonssoleolin.<
uskyankaFSibandrawl
The author's local GPU (an RTX 3060 with 12GB memory) runs out of memory trying to perform backpropegation on the full 48-block gpt2-xl
. In this case we are faced with a few options: either we can use larger GPUs or else we can compress the model or model's parameters in some way. Model parameters are usually set to a 32-bit floating point datatype by default, but for this input representation visualization process we do not actually need full precision. Instead, we can load and convert the model to (mostly) 8-bit precision using Tim Dettmer's very useful bitsandbytes library (or the ported bitsandbytes-windows version if you are using Windows), which is integrated into the AutoModelForCausalLM
module in the Huggingface Transformers library such that a model may be loaded to a GPU in 8 bit precision using the following:
model = AutoModelForCausalLM.from_pretrained("gpt2-xl", load_in_8bit=True, device_map='auto')
bitsandbytes
does not actually convert all model parameters to 8-bit precision due to the ubiquitous presence of outlier features in large language models, as shown by Dettmers and colleages. Note that most CPUs do not support linear algebraic operations with any datatype less than 32 bits, so a GPU should be used here. Once the model has been loaded, for the full 48-block stack of GPT-2 we have a top-5 representation of the input 'The sky is blue.' represented as
COURisphere Dragonboundheit\<
=>addonsGWolin>.
umerableankaOrgNTrawl
CVE '[soleiband.<
fter ('ramidohl$.
It is interesting to note that gradient-based optimization of the input is much more difficult for the full 48-block GPT-2 than even first 24-block subset of this model, indicating that the later transformer blocks are poorly conditioned relative to the earlier blocks. This is true even when 8-bit quantization is performed on the smaller subsets, indicating that quantization is not in this case responsible for difficulties optimizing via gradient descent.
To summarize this last section, simply increasing the model's size does seem to reduce the amount of repetition, but even the larger models we have investigated thus far do not exhibit accurate representations of language inputs. This investigation is continued in Part III.
In summary, transformer-based language models in the GPT-2 family are unable to distinguish between real English sentences and pure gibberish. Given a point in a transformer block hidden layer space corresponding to an input of a real sentence, we have found that most nearby points correspond to inputs that are not even approximately sentences but are instead completely unintelligible.
There exists a notable difference between trained language and vision transformer models: the latter contain modules that are at least partially capable of discerning what the input was composed of, whereas the latter does not. But when we consider the training process for language models, it is perhaps unsurprising that input representations are relatively poor. Note that each of the gibberish input generations were almost certainly not found in the training dataset precisely because they are very far from any real language. This means that the language model has no a priori reason to differentiate between these inputs and real text, and thus it is not altogether unsurprising that the model's internal representations would be unable to distinguish between the two.
In a sense, it is somewhat more unexpected that language models are so poorly capable of approximate rather than exact input representation. Consider the case of one untrained transformer encoder detailed in a previous section: this module was unable to exactly represent the input (for the majority of random initializations) but the representations generated are semantically similar to the original prompt. This is what we saw for vision models, as input representations strongly resembled the target input even if they were not exact copies. The training process therefore results in the loss of representational accuracy from the transformer encoders.
In this page's introduction, we considered the implications of the 'no free lunch' theorems that state (roughly) that no on particular model is better than any others at all possible machine learning tasks. Which tasks a given model performs well or poorly on depends on the model architecture and training protocol, and on this page we saw that trained language models perform quite poorly at the task of input generation because even early layers are unable to differentiate between English sentences and gibberish. Non-invertibility alone does not explain why these trained models are poor discriminators, but the training protocol (simply predicting the next word in a sentence) may do so.
When one compares the difference in vision versus language transformer model input representational ability, it is clear that language models retain much less information on their inputs. But when we consider the nature of language, this may not be altogether surprising: language places high probability on an extraordinarily small subset of possible inputs relative to natural images. For example, an image input's identity is invariant to changes in position, orientation (rotation), order, smoothness (ie a grainy image of a toaster is still a toaster), and brightness. But a language input is not invariant to any of these transformations. A much smaller subset of all possible inputs (all possible word or bytecode tokens in a sequence) are therefore equivalent to language models, and the invariants to be learned may be more difficult to identify than for vision models.
Finally, it is also apparent that it is much more difficult to optimize an input via gradient descent on the loss of the model output after training versus before. This appears to be a somewhat attention-intrinsic phenomenon and was also observed to be present in Vision Transformers, but we find that language modeling heads (transformations from hidden space to output tokens) are particularly difficult to optimize across. This observations suggests that efficient training of transformer-based models is quite difficult, and could contribute to the notoriously long training time required for large language models. Freezing the language modeling head transformation would therefore be expected to assist the training process.