如何使用rstrip去除末尾字符？

2 投票

2 回答

4321 浏览

提问于 2025-04-16 05:30

我正在尝试遍历我手头的一些文档，把每个文档里的单词放到一个列表中。我是这样做的，stoplist 是一个我默认想要忽略的单词列表。

texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

我得到了一个文档列表，每个文档里又有一个单词列表。不过，有些单词后面还带着标点符号或者其他奇怪的东西。我本来以为我可以这样处理，但似乎效果不太好。

texts = [[word.rstrip() for word in document.lower().split() if word not in stoplist]
         for document in documents]

或者

texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
         for document in documents]

我还有一个问题。我可能会遇到一些单词，我想保留这个单词，但把后面的数字或特殊字符去掉。

agency[15]
assignment[72],
you&#8217;ll
america&#8217;s

所以为了清理这些杂乱的东西，我在想应该从字符串的末尾开始去掉字符，直到只剩下字母（a-zA-Z），或者如果字符串里的特殊字符比字母多，就把它丢掉。不过在我最后两个例子中，字符串的末尾是一个字母。所以在这种情况下，我应该忽略这个单词，因为特殊字符的数量超过了字母。我在想，应该只检查字符串的末尾，因为我希望尽可能保留带连字符的单词。

总之，我想去掉每个单词后面的所有标点符号，可能还需要一个子程序来处理我刚才描述的情况。我不太确定该怎么做，或者这样做是否是最好的方法。

正则表达式字符串处理字符串操作特殊字符文本清理数据预处理标点符号单词列表

2 个回答

你可以试试用 re.findall，配合一个像 [a-z]+ 这样的模式：

import re
word_re = re.compile(r'[a-z]+')
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist]
          for document in documents]

texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist]
          for document in documents]

这样你就可以轻松调整你的正则表达式，来获取你想要的单词。还有一种方法是用 re.split：

import re
word_re = re.compile(r'[^a-z]+')
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist]
          for document in documents]

回答于 2025-04-16 由 Python大师

分享举报

>>> a = ['agency[15]','assignment72,','you&#8217;11','america&#8217;s']
>>> import re
>>> b = re.compile('\w+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment72
you
america
>>> b = re.compile('[a-z]+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment
you
america
>>>

>>> a = "I-have-hyphens-yo!"
>>> re.findall('[a-z]+',a)
['have', 'hyphens', 'yo']
>>> re.findall('[a-z-]+',a)
['-have-hyphens-yo']
>>> re.findall('[a-zA-Z-]+',a)
['I-have-hyphens-yo']
>>> re.findall('\w+',a)
['I', 'have', 'hyphens', 'yo']
>>>

更新

回答于 2025-04-16 由 Python大师

分享举报

如何使用rstrip去除末尾字符？

2 个回答

撰写回答