如何去除标点符号？

7 投票

2 回答

31277 浏览

数据工程师

提问于 2025-04-18 04:24

我在用Python中的NLTK库进行分词。

论坛上已经有很多关于去掉标点符号的答案了。不过，没有一个答案能同时解决以下所有问题：

连续出现多个符号。比如这句话：He said,"that's it." 因为有一个逗号后面跟着引号，所以分词器不会把句子中的."去掉。分词器会返回['He', 'said', ',"', 'that', 's', 'it.']，而不是['He','said', 'that', 's', 'it']。其他类似的例子还有'...'、'--'、'!?'、',"等等。
去掉句子末尾的符号。比如这句话：Hello World。分词器会返回['Hello', 'World.']，而不是['Hello', 'World']。注意到'World'后面的句号了吧。其他类似的例子还有'--'、','，这些符号可能出现在任何字符的开始、中间或结尾。
去掉前后都有符号的字符。比如'*u*', '''','""'。

有没有什么优雅的方法来解决这两个问题呢？

文本处理自然语言处理 nltk 数据预处理语言模型文本清洗分词标点符号

2 个回答

解决方案1：把文本分成一个个小部分，然后去掉这些小部分里的标点符号。

>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']

解决方案2：先去掉标点符号，再把文本分成一个个小部分。

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']

回答于 2025-04-18 由 Python大师

分享举报

如果你想一次性把字符串分割成小块，我觉得你可以用 nltk.tokenize.RegexpTokenizer。这个方法可以让你用标点符号来标记，先去掉字母，再把标点符号去掉。换句话说，这个方法会先去掉 *u*，然后再去掉所有的标点。

一种做法是根据空格来分割，像这样：

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

这样应该能满足你上面提到的三个要求。不过要注意，这个分割器不会返回像 "A" 这样的单个字母。此外，我只会对那些前后都有标点的单个字母进行分割。比如，“Go.” 这个词就不会被分割成小块。你可能还需要根据你的数据和期望，调整一下正则表达式的写法。

回答于 2025-04-18 由 Python大师

分享举报

如何去除标点符号？

2 个回答

撰写回答