如何删除每个非字母字符的单词

2024-05-16 02:47:21 发布

您现在位置:Python中文网/ 问答频道 /正文

为了测试Zipf定律,我需要编写一个python脚本来删除文本文件中包含非字母字符的每个单词。 例如:

asdf@gmail.com said: I've taken 2 reports to the boss

^{pr2}$

我该怎么做?在


Tags: to脚本com字母vezipf字符单词
3条回答

您可以使用split()和isisalpha()来获取只有字母字符且至少有一个字符的单词列表。在

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

然后可以使用join()将列表变成一个字符串:

^{pr2}$

nltk包专门处理文本,并具有各种函数,可用于将文本“标记化”为单词。在

您可以使用RegexpTokenizer,也可以使用稍加修改的word_tokenize。在

最简单最简单的方法是RegexpTokenizer

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

返回:

^{pr2}$

或者您可以使用稍微聪明一点的word_tokenize,它能够将大多数收缩像didn't分成did和{}。在

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

返回:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

使用正则表达式只匹配字母(和下划线),可以执行以下操作:

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

相关问题 更多 >