attention
# 为什么不使用one-hot 编码/ why not use one-hot encoding
之前一直使用的是one_hot,但是从简洁实现开始,所使用的都是 nn.embedding
。这两者的主要区别就是 one_hot 将词之间分的太开了。
We proviously used one-hot encoding, but in the consice implementation, we replaced it with nn.embedding. The main difference between one-hot encoding and nn.embedding is that one-hot encoding results in a more widely spaced distribution of words.
每个词都被看作一个单独的向量,这在字符的时候效果是可以的,因为每个字母之间的联系确实不大。但是在单词分词中,这种方法不仅没有体现词之间的关系,而且词越多就需要更多的变量,这将带来参数量的提升。
Each word is represented as a single vector, which is effictive for character-level repressentations since there is limited interrelation between individual characters. However, when applied to word-level tokenization, this method not only fails to capture the relationships between words but also requires more variables as the number of words increases, leading to an increase in the number of parameters.
而 nn.embedding 将分词映射到一个长度为 n 的向量上。不仅固定了向量大小,而且是一个可学习的层,这也就意味着在过程中,相似的词会逐渐靠近,不同的词会远离(至少我们是这样想的)。
The nn.embedding maps the tokens to a vector of length n. it not only fixes the size of the vector but also serves as a learnable layer. this implies that during the training precess, similar words will gradually converge while different words will be pushed further apart(at least, this is our assumption)