提高datafram上文本清理的性能

def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN

stops = set(stopwords.words('english')) lem = WordNetLemmatizer() def clean(text): lettersOnly = re.sub('[^a-zA-Z]',' ', text) tokens = word_tokenize(lettersOnly.lower()) tokens = [w for w in tokens if not w in stops] tokensPOS = pos_tag(tokens) tokensLemmatized = [] for w in tokensPOS: tokensLemmatized.append(lem.lemmatize(w[0], get_wordnet_pos(w[1]))) clean = " ".join(tokensLemmatized) return clean

672542 function calls (672538 primitive calls) in 2.798 seconds Ordered by: internal time List reduced from 211 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 4097 0.727 0.000 0.942 0.000 perceptron.py:48(predict) 4500 0.584 0.000 0.584 0.000 {built-in method nt.stat} 3500 0.243 0.000 0.243 0.000 {built-in method nt._isdir} 14971 0.157 0.000 0.178 0.000 {method 'sub' of '_sre.SRE_Pattern' objects} 57358 0.129 0.000 0.155 0.000 perceptron.py:250(add) 4105 0.117 0.000 0.201 0.000 {built-in method builtins.max} 184365 0.084 0.000 0.084 0.000 perceptron.py:58(<lambda>) 4097 0.057 0.000 0.213 0.000 perceptron.py:245(_get_features) 500 0.038 0.000 1.220 0.002 perceptron.py:143(tag) 2000 0.034 0.000 0.068 0.000 ntpath.py:471(normpath)

1条回答

网友

1楼 · 发布于 2024-05-16 23:25:47

我在这里看到的第一个明显的改进点是整个get_wordnet_pos函数应该可以简化为字典查找：

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

相反，请从collections包初始化defaultdict：

^{pr2}$

然后，您将按如下方式访问查找：

get_wordnet_pos[w[1][0]]

接下来，如果要在多个地方使用regex模式，可以考虑预编译它。你得到的加速没有那么多，但这一切都很重要。在

pattern = re.compile('[^a-zA-Z]')

在函数内部，可以调用：

pattern.sub(' ', text)

哦，如果你知道你的文本是从哪里来的，并且对你可能看到的和可能看不到的内容有一些了解，那么你可以预先编译一个字符列表，而使用str.translate，这比笨重的基于regex的替换快得多：

tab = str.maketrans(dict.fromkeys("1234567890!@#$%^&*()_+-={}[]|\'\":;,<.>/?\\~`", '')) # pre-compiled use once substitution table (keep this outside the function)

text = 'hello., hi! lol, what\'s up'
new_text = text.translate(tab) # this would run inside your function

print(new_text)

'hello hi lol whats up'

此外，我想说word_tokenize是过度杀戮-你所做的就是去掉特殊字符，这样你就失去了word_tokenize的所有好处，而这确实与标点符号等不同。你可以选择回到text.split()。在

最后，跳过clean = " ".join(tokensLemmatized)步骤。只需返回列表，然后在最后一步中调用df.applymap(" ".join)。在

我把基准测试留给你。在

相关问题更多 >

编程相关推荐

热门问题

热门文章