Python: 自动纠正

1 投票

1 回答

2681 浏览

提问于 2025-04-18 04:13

我有两个文件，分别叫 check.txt 和 orig.txt。我想检查 check.txt 中的每一个单词，看看它是否和 orig.txt 中的任何单词匹配。如果匹配上了，代码就应该把这个单词替换成它第一次匹配到的那个单词；如果没有匹配上，就保持这个单词不变。不过现在代码似乎没有按预期工作。请帮帮我。

check.txt 的内容是这样的：

ukrain

troop

force

而 orig.txt 的内容是：

ukraine cnn should stop pretending &amp; announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine -  moscow:... http://t.co/nqim7uk7zg
 #groundtroops #russianpresidentvladimirputin

http://pastebin.com/XJeDhY3G

f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
    for line in orig:
        for word2 in line.split(" "):
            word2 = word2.lower()            
            if word in word2:
                word = word2
            else:
                print('not found')
        new.write(word)

文本处理字符串操作编程调试数据清洗文件比较词汇匹配自动替换文本纠错

1 个回答

你的代码有两个问题：

当你遍历 f 中的单词时，每个单词后面还会有一个换行符，所以你的 in 检查就不管用了。
你想要对 f 中的每个单词都去遍历 orig，但是文件是一个迭代器，读取一次后就不能再用了。

你可以通过使用 word = word.strip() 来去掉换行符，并用 orig = list(orig) 将 orig 转换成列表来解决这些问题，或者你可以试试下面这种方法：

# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
    for word in original:
        if stem in word:
            unstemmed[stem] = word
print unstemmed

或者更简洁一些（不需要最后的双重循环），可以使用 difflib，正如评论中提到的：

unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}

另外，记得要 close 你的文件，或者使用 with 关键字，这样可以自动关闭文件。

回答于 2025-04-18 由 Python大师

分享举报

Python: 自动纠正

1 个回答

撰写回答