如何删除每个非字母字符的单词

网友

1楼 · 编辑于 2024-05-16 02:47:21

您可以使用split()和isisalpha()来获取只有字母字符且至少有一个字符的单词列表。在

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

然后可以使用join()将列表变成一个字符串：

^{pr2}$

网友

2楼 · 编辑于 2024-05-16 02:47:21

nltk包专门处理文本，并具有各种函数，可用于将文本“标记化”为单词。在

您可以使用RegexpTokenizer，也可以使用稍加修改的word_tokenize。在

最简单最简单的方法是RegexpTokenizer：

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

返回：

^{pr2}$

或者您可以使用稍微聪明一点的word_tokenize，它能够将大多数收缩像didn't分成did和{}。在

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

返回：

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

网友

3楼 · 编辑于 2024-05-16 02:47:21

使用正则表达式只匹配字母（和下划线），可以执行以下操作：

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何删除每个非字母字符的单词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >