Google Reveals the Reason Why Big Models Don't Count R

Qubits
2024/09/04 15:20

| Official account QbitAI

The large model is easy to do Olympiad problems, but the reason why simple counting repeatedly overturns has been found.

A new Google study found that the reason why large models don't count is not because of a simple tokenizer, but because there is not enough space to store vectors for counting.

Counting the number of times a word appears in a passage is a simple task that can stump many large models, including GPT-4o and Claude 3.5.

If you go further, it is even more difficult to find the word that appears most frequently, and even if you can get the specific amount given, it is wrong.

Some people think that the tokenization of words leads to the inconsistency between the "words" seen by the large model and our perceptions, but the paper shows that the reality is not so simple.

The counting ability of the Transformer is closely related to the relationship between its embedding dimension d and vocabulary m (referring to the number of words in the vocabulary, not the length of the sequence).

The detailed reason involves the mechanism of Transformer when counting word frequency.

Through a special embedding method, Transformer uses the linear structure of the embedding space to cleverly transform the counting problem into vector addition.

Specifically, each word is mapped to a unique orthogonal vector, and under this representation, the word frequency can be simply calculated by summing these orthogonal vectors.

However, the limitation of this mechanism is that it requires that each word in the vocabulary has an independent orthogonal vector representation, so the embedding dimension must be larger than the vocabulary.

When the embedding dimension is insufficient, the word vector cannot maintain orthogonality, and the linear superposition of word frequency cannot be realized.

At this point, the Transformer can implement counting through the attention mechanism (CountAttend), but it requires a large "reverse MLP" layer that grows linearly with the length of the sequence.

Specifically, the model first assigns a large weight to the queried word through attention, and then uses positional coding to extract the attention weight to the last element of the value vector, which actually records the reciprocal of the occurrence frequency of the queried word.

This means that the model needs an MLP layer of size O(n) to calculate the 1/x function (x is the number of times a word occurs).

However, further analysis shows that any constant-layer ReLU network cannot approximate the 1/x function at the number of neurons at o(n).

Therefore, for fixed-scale Transformers, this scheme cannot be generalized to sequences of any length. When the sequence length exceeds the length of the training set, the model's counting capacity deteriorates dramatically.

To test this conclusion, the authors conducted two experiments.

The first experiment was conducted on a Transformer model trained from scratch with the following parameters:

Use a standard model consisting of two Transformer layers and four attention heads

The value range of embedding dimension d is 8 to 128;

For each fixed d, the vocabulary m varies from 5 to 150, and 20 different values are tested separately；

The model is trained from scratch using the Adam optimizer with a batch size of 16, a learning rate of 10 ^ -4, and 100000 training steps。

Training and evaluation data are generated using random sampling. First, n words are evenly sampled from a vocabulary of size m to form a sequence of length n.

The sequence length n was set to n = 10 m, and the average number of occurrences per word was fixed at 10 times, and a total of 1600 samples were used for testing.

The authors found that as the vocabulary increased, the counting accuracy of the model decreased in a stepwise manner, and the tipping point occurred precisely at the moment when the vocabulary exceeded the embedded dimension.

To further quantify the counting power of the model, the authors defined a metric m_thr that represents the critical vocabulary when the model's counting accuracy drops to 80%.

Intuitively, m_thr reflects the maximum vocabulary that the model can "withstand" for a given embedding dimension, and the larger the m_thr, the stronger the model's counting power.

The results show that for the tasks of counting (QC) and finding the highest frequency word (MFC), the m_thr increases approximately linearly with the increase of the embedding dimension d.

The second experiment was conducted on a pre-trained Gemini 1.5 model, in which the authors focused more on the effect of vocabulary on counting ability.

They devised a series of counting tasks, each using a vocabulary of different sizes, and fixing the average number of times each word appears in the sequence.

This means that the larger the vocabulary in the experimental group, the longer the sequence length.

As a control, the authors also set up a "Binary Baseline" with only two words fixed in the vocabulary, but the sequence length was the same as that of the main experimental group.

In this way, it is possible to determine whether it is the vocabulary or the length of the sequence that is causing the error in the model count.

Experimental results show that the average absolute error of Gemini 1.5 on the counting task increases significantly with the increase of vocabulary, while the error of "Binary Baseline" is much lower.

This suggests that the increase in vocabulary, rather than the increase in sequence length, is the main reason for the decline in the counting ability of large models.

However, the authors also said that although the study delineated the upper and lower bounds of the counting capacity of large models to a certain extent, these boundaries were not tight enough, and there was still a certain gap from the ideal results.

At the same time, the authors did not explore whether increasing the number of layers of Transformer would change this conclusion, and new technical tools need to be developed in the future to further validate.

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com