>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
import nltk
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
import re
s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
您可以使用split()和isisalpha()来获取只有字母字符且至少有一个字符的单词列表。在
然后可以使用join()将列表变成一个字符串:
^{pr2}$nltk
包专门处理文本,并具有各种函数,可用于将文本“标记化”为单词。在您可以使用
RegexpTokenizer
,也可以使用稍加修改的word_tokenize
。在最简单最简单的方法是
RegexpTokenizer
:返回:
^{pr2}$或者您可以使用稍微聪明一点的}。在
word_tokenize
,它能够将大多数收缩像didn't
分成did
和{返回:
使用正则表达式只匹配字母(和下划线),可以执行以下操作:
相关问题 更多 >
编程相关推荐