如何使用RegExpTokenizer排除所有小写(az)组合?

2024-05-15 19:05:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用regexptokenizerpattern选项删除所有小写字母组合。有什么办法吗?你知道吗

我试过的是:

import re
from nltk import regexp_tokenize
data = {'fresh air', 'entertainment system', 'ice cream', 'milk', 'dog', 'blood pressure', 'body temperature', 'car', 'ac', 'auto', 'air quality'}
data = {i: i.replace(" ", "_") for i in data}
pattern = re.compile(r"\b("+"|".join(data)+r")\b")
text_file = ['A is\'s vitamin-d in===(milk) "enough, carrying 321 active automatic body hi+al.', '{body temperature} [try] to=== improve air"s quality level by automatic intake of fresh air.', 'turn on the tv or entertainment system based on that individual preferences', 'blood pressure monitor', 'I buy more ice cream', 'proper method to add frozen wild blueberries in ice cream']
result = [pattern.sub(lambda x: "{}".format(data[x.group()]), i) for i in text_file]
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|[0-9!()\-+\$%;,.:@'\"/={}\[\]\']", gaps=True)] for word in result]  
print(tokens)

注意:我需要输出保持其当前形式。我只需要排除小写字母。提前谢谢。你知道吗

添加[^a-z]对我来说根本不起作用,它在某些词中省略了注入的下划线符号,这是我无法承受的损失。你知道吗


Tags: inimportrefordatabodyairword