PyTorch：将单词向量加载到字段词汇表与嵌入层

# PyTorch code. # Create a field for text and build a vocabulary with 'glove.6B.100d' # pretrained embeddings. TEXT = data.Field(tokenize = 'spacy', include_lengths = True) TEXT.build_vocab(train_data, vectors='glove.6B.100d') # Build an RNN model with an Embedding layer. class RNN(nn.Module): def __init__(self, ...): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) ... # Initialize the embedding layer with the Glove embeddings from the # vocabulary. Why are two steps needed??? model = RNN(...) pretrained_embeddings = TEXT.vocab.vectors model.embedding.weight.data.copy_(pretrained_embeddings)

1条回答

网友
1楼 · 发布于 2024-04-26 11:59:12

当torchtext构建词汇表时，它将标记索引与嵌入对齐。如果您的词汇表的大小和顺序与预先训练过的嵌入词不同，那么索引就不能保证匹配，因此您可能会查找不正确的嵌入词build_vocab()使用相应的嵌入为数据集创建词汇表，并丢弃其余的嵌入，因为它们是未使用的
手套6B嵌入件包括尺寸为400K的词汇表。例如IMDB dataset只使用其中约120K个，其他280K个未使用
import torch from torchtext import data, datasets, vocab TEXT = data.Field(tokenize='spacy', include_lengths=True) LABEL = data.LabelField() train_data, test_data = datasets.IMDB.splits(TEXT, LABEL) TEXT.build_vocab(train_data, vectors='glove.6B.100d') TEXT.vocab.vectors.size() # => torch.Size([121417, 100]) # For comparison the full GloVe glove = vocab.GloVe(name="6B", dim=100) glove.vectors.size() # => torch.Size([400000, 100]) # Embedding of the first token is not the same torch.equal(TEXT.vocab.vectors[0], glove.vectors[0]) # => False # Index of the word "the" TEXT.vocab.stoi["the"] # => 2 glove.stoi["the"] # => 0 # Same embedding when using the respective index of the same word torch.equal(TEXT.vocab.vectors[2], glove.vectors[0]) # => True
构建词汇表及其嵌入后，将在标记化版本中给出输入序列，其中每个标记由其索引表示。在模型中，您希望使用这些的嵌入，因此您需要创建嵌入层，但要使用词汇表的嵌入。最简单和推荐的方法是^{}，它基本上与Keras版本相同
embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors) # Or if you want to make it trainable trainable_embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)
您没有提到embedding_matrix是如何在Keras版本中创建的，也没有提到词汇表是如何构建的，以便可以与embedding_matrix一起使用。如果您手动（或使用任何其他实用程序）完成此操作，则根本不需要torchtext，您可以像在Keras中一样初始化嵌入torchtext纯粹是为了方便执行与常见数据相关的任务

相关问题更多 >

编程相关推荐

热门问题

热门文章