为什么不使用one-hot 编码/ why not use one-hot encoding
之前一直使用的是one_hot,但是从简洁实现开始,所使用的都是 nn.embedding
。这两者的主要区别就是 one_hot 将词之间分的太开了。
We proviously used one-hot encoding, but in the consice implementation, we replaced it with nn.embedding. The main difference between one-hot encoding and nn.embedding is that one-hot encoding results in a more widely spaced distribution of words.
Each word is represented as a single vector, which is effictive for character-level repressentations since there is limited interrelation between individual characters. However, when applied to word-level tokenization, this method not only fails to capture the relationships between words but also requires more variables as the number of words increases, leading to an increase in the number of parameters.
而 nn.embedding 将分词映射到一个长度为 n 的向量上。不仅固定了向量大小,而且是一个可学习的层,这也就意味着在过程中,相似的词会逐渐靠近,不同的词会远离(至少我们是这样想的)。
The nn.embedding maps the tokens to a vector of length n. it not only fixes the size of the vector but also serves as a learnable layer. this implies that during the training precess, similar words will gradually converge while different words will be pushed further apart(at least, this is our assumption)