过滤文本文件中的外语停用词

0 投票

3 回答

606 浏览

提问于 2025-04-18 18:38

我有一个文本文件，里面列着很多电影名字，既有英文的，也有其他语言的，每个名字都在新的一行上：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex ist auch keine Lˆsung
Keine Angst Liebling, ich pass schon auf
Keiner hat das Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
La Prima Donna
La Primeriza
La Prison De Saint-Clothaire
La Puppe
La P·jara
La PÈrgola de las Flores

我还整理了一些常见的非英语的“停用词”，比如 La、de、las、das，我想把这些词从文本文件中过滤掉。请问我该怎么做才能读取我的文本，过滤掉这些词，然后把过滤后的结果以原来的格式打印到一个新的文本文件里？我希望最终的输出大概是这样的：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex keine Lˆsung
Keine Angst Liebling, pass schon
Keiner hat Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
Prima Donna
Primeriza
Prison Saint-Clothaire
Puppe
P·jara
Èrgola Flores

为了更清楚，我知道可以使用 NLTK 这个库，它有一个更全面的停用词列表，但我想找一种方法，只针对我自己选定的几个词。

文件操作文本处理数据清洗 nltk 文本过滤停用词自定义停用词

3 个回答

首先，读取文件：

with open('file', 'r') as f:
    inText = f.read()

你需要一个函数，这个函数可以接受一个你不想出现在文本中的字符串。这个操作可以一次性处理整个文本，而不是一行一行地处理。而且，你希望这个文本可以在全局使用，所以我建议你创建一个类：

class changeText( object ):
    def __init__(self, text):
        self.text = text
    def erase(self, badText):
        self.text.replace(badText, '')

不过，当你把某个词替换成空的时候，会出现双空格，还有换行符后面跟着空格的情况，所以你需要写一个方法来清理这些多余的空格。

    def cleanup(self):
        self.text.replace('  ', ' ')
        self.text.replace('\n ', '\n')

接下来，初始化对象：

textObj = changeText( inText )

然后，遍历一遍不好的词列表，进行清理：

for bw in badWords:
    textObj.erase(bw)
textObj.cleanup()

最后，把结果写入文件：

with open('newfile', 'r') as f:
    f.write(textObj.text)

回答于 2025-04-18 由 Python大师

分享举报

你可以使用 re 模块（https://docs.python.org/2/library/re.html#re.sub）来把你不想要的字符串替换成空白。像下面这样的代码应该可以实现：

    import re
    #save your undesired text here. You can use a different data structure
    #  if the list is big and later build your match string like below
    unDesiredText = 'abc|bcd|vas'

    #set your inputFile and outputFile appropriately
    fhIn = open(inputFile, 'r')
    fhOut = open(outputFile, 'w')

    for line in fhIn:
        line = re.sub(unDesiredText, '', line)
        fhOut.write(line)

    fhIn.close()
    fhOut.close

回答于 2025-04-18 由 Python大师

分享举报

另外一种方法，如果你对异常处理和其他相关细节感兴趣的话：

import re

stop_words = ['de', 'la', 'el']
pattern = '|'.join(stop_words)
prog = re.compile(pattern, re.IGNORECASE)  # re.IGNORECASE to catch both 'La' and 'la' 

input_file_location = 'in.txt'
output_file_location = 'out.txt'

with open(input_file_location, 'r') as fin:
    with open(output_file_location, 'w') as fout:
        for l in fin:
            m = prog.sub('', l.strip())  # l.strip() to remove leading/trailing whitespace
            m = re.sub(' +', ' ', m)  # suppress multiple white spaces
            fout.write('%s\n' % m.strip())

回答于 2025-04-18 由 Python大师

分享举报

过滤文本文件中的外语停用词

3 个回答

撰写回答