如何使用RegExpTokenizer排除所有小写（az）组合？

2024-05-15 19:05:50 发布

您现在位置：Python中文网/ 问答频道 /正文

1010

网友

男 | 程序猿一只，喜欢编程写python代码。

我需要使用regexptokenizer的pattern选项删除所有小写字母组合。有什么办法吗？你知道吗

我试过的是：

import re
from nltk import regexp_tokenize
data = {'fresh air', 'entertainment system', 'ice cream', 'milk', 'dog', 'blood pressure', 'body temperature', 'car', 'ac', 'auto', 'air quality'}
data = {i: i.replace(" ", "_") for i in data}
pattern = re.compile(r"\b("+"|".join(data)+r")\b")
text_file = ['A is\'s vitamin-d in===(milk) "enough, carrying 321 active automatic body hi+al.', '{body temperature} [try] to=== improve air"s quality level by automatic intake of fresh air.', 'turn on the tv or entertainment system based on that individual preferences', 'blood pressure monitor', 'I buy more ice cream', 'proper method to add frozen wild blueberries in ice cream']
result = [pattern.sub(lambda x: "{}".format(data[x.group()]), i) for i in text_file]
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|[0-9!()\-+\$%;,.:@'\"/={}\[\]\']", gaps=True)] for word in result]  
print(tokens)

注意：我需要输出保持其当前形式。我只需要排除小写字母。提前谢谢。你知道吗

添加[^a-z]对我来说根本不起作用，它在某些词中省略了注入的下划线符号，这是我无法承受的损失。你知道吗

Tags： in import re for data body air word

1条回答

网友

1楼 · 发布于 2024-05-15 19:05:50

尝试以下模式：

pattern=r"\s|[0-9!()\-+\$%;,.:@'\"/={}\[\]\'].|[^\w a-z]"

如何使用RegExpTokenizer排除所有小写（az）组合？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用RegExpTokenizer排除所有小写（az）组合？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >