将不在dict中的单词替换为<unk>

1条回答

网友

1楼 · 发布于 2024-04-26 14:02:18

我们将问题分为两部分：

给定单词列表passage，找到索引，其中{}不在另一个单词列表dictionary中。在
然后simpy把<unk>放在这些指数上。在

主要工作是在1。为此，我们首先将字符串列表转换为2dnumpy数组，以便能够高效地执行操作。另外，我们对下面的二元搜索中需要的字典进行排序。另外，我们用0填充字典，使其列数与passage_enc相同。在

# assume passage, dictionary are initially lists of words
passage = np.array(passage)  # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)    
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

然后我们只需遍历passage，并检查字符串（现在是一个数组）是否在字典中。它需要O（n*m），n，m分别是文章和字典的大小。但是，我们可以通过事先对字典进行排序并在其中进行二进制搜索来改进这一点。所以，它变成了O（n*logm）。在

为了更快地编译代码，我们还可以使用JIT。下面，我使用numba。在

^{pr2}$

检查样本数据

import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary)  # sort for binary search

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

文章和字典的大小，最终得出的顺序，操作要求，为计时目的。这个电话：

unknown_indices = replace(dictionary_enc, passage_enc)

在我的8核16g系统上需要17.028s（包括预处理时间，显然不包括加载语料库的时间）。在

那么，很简单：

passage[unknown_indices] = "<unk>"

注：我想，在njit装饰器中使用parallel=True可以获得更快的速度。我得到一些奇怪的错误，将编辑如果我能解决它。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

将不在dict中的单词替换为<unk>

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >