在掩模位置求出多音词的概率

from transformers import BertTokenizer, BertForMaskedLM import torch # init model and tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') model.eval() # init softmax to get probabilities later on sm = torch.nn.Softmax(dim=0) torch.set_grad_enabled(False) # set sentence with MASK token, convert to token_ids sentence = f"I {tokenizer.mask_token} you" token_ids = tokenizer.encode(sentence, return_tensors='pt') print(token_ids) # tensor([[ 101, 1045, 103, 2017, 102]]) # get the position of the masked token masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().item() # forward output = model(token_ids) last_hidden_state = output[0].squeeze(0) # only get output for masked token # output is the size of the vocabulary mask_hidden_state = last_hidden_state[masked_position] # convert to probabilities (softmax) # giving a probability for each item in the vocabulary probs = sm(mask_hidden_state) # get probability of token 'hate' hate_id = tokenizer.convert_tokens_to_ids('hate') print('hate probability', probs[hate_id].item()) # hate probability 0.008057191967964172 # get probability of token 'love' love_id = tokenizer.convert_tokens_to_ids('love') print('love probability', probs[love_id].item()) # love probability 0.6704086065292358 # get probability of token 'reprimand' (?) reprimand_id = tokenizer.convert_tokens_to_ids('reprimand') # reprimand is not in the vocabulary, so it needs to be split into subword units print(tokenizer.convert_ids_to_tokens(reprimand_id)) # [UNK] reprimand_id = tokenizer.encode('reprimand', add_special_tokens=False) print(tokenizer.convert_ids_to_tokens(reprimand_id)) # ['rep', '##rim', '##and'] # but how do we now get the probability of a multi-token word in a single-token position?

1条回答

网友
1楼 · 发布于 2024-05-14 23:55:55

由于拆分词在字典中没有出现，BERT根本没有意识到它的概率，因此在标记化之前没有必要屏蔽它。在
利用链式法则不能得到它的概率，参见J.Devlin的responce。为了说明这一点，让我们举一个更一般的例子。尝试估计位置i中某个二元图的概率。你可以根据句子和它们的位置来估计每个单词的概率
P(w_i|w_0, w_1... w_i-1, w_i+1, ..., w_N)
P(w_i+1|w_0, w_1... w_i, wi+2, ..., w_N)
没有办法得到二元曲线的概率
P(w_i,w_i+1|w_0, w_1... w_i-1, wi+2, ..., w_N)
因为伯特不存储这些信息。在
说到这里，你可以通过乘以你看到它的部分的概率得到一个非常粗略的估计。所以你会得到
P("reprimand"|...) ~= P("rep"|...)*P("##rim"|...)*P("##and"|...)
既然你的子词不是普通的词，而是一种特殊的词，这并不是全部错的，因为它们之间的依赖是隐含的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章