文件中的字符串与集合中的字符串不匹配

0 投票

2 回答

565 浏览

提问于 2025-04-18 10:17

我有一个文件，每一行都有一个单词，还有一个单词集合，我想把集合中不相等的单词放到这个文件里。以下是我代码的一部分：

def createNextU(self):
    print "adding words to final file"
    if not os.path.exists(self.finalFile):
        open(self.finalFile, 'a').close
    fin = open(self.finalFile,"r")
    out = set()
    for line in self.lines_seen: #lines_seen is a set with words
        if line not in fin:
            out.add(line)
        else:
            print line
    fin.close()
    fout= open(self.finalFile,"a+")
    for line in out:
        fout.write(line)

但是它只匹配了一些真正相等的单词。我用同样的单词字典，每次运行时都会把重复的单词添加到文件里。我到底哪里做错了？发生了什么？我试过用 '==' 和 'is' 这两种比较方式，结果都是一样的。

编辑 1：我在处理很大的文件（finalFile），这些文件不能完全加载到内存里，所以我想我应该逐行读取文件。

编辑 2：发现了一个大问题，关于指针：

def createNextU(self):
    print "adding words to final file"
    if not os.path.exists(self.finalFile):
        open(self.finalFile, 'a').close
    out = set()
    out.clear()
    with open(self.finalFile,"r") as fin:
        for word in self.lines_seen:
            fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
            if word in fin:
                self.totalmatches = self.totalmatches+1
            else:
                out.add(word)
                self.totalLines=self.totalLines+1


    fout= open(self.finalFile,"a+")
    for line in out:
        fout.write(line)

如果我把 lines_seen 的循环放在打开文件之前，我就会为 lines_seen 中的每一行都打开一次文件，但速度只有每秒 30,000 行。而使用 set() 的话，最差也能达到每秒 200,000 行，所以我想我会分批加载文件，然后用集合来比较。有没有更好的解决方案？

编辑 3：完成了！

集合操作性能优化数据结构内存管理字符串比较文件处理行读取重复数据

2 个回答

你的比较可能不太有效，因为从文件中读取的每一行末尾都有一个换行符，所以你实际上是在比较'word\n'和'word'。使用'rstrip'可以帮助你去掉这些多余的换行符：

>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'

我建议你直接遍历文件，而不是遍历你想要检查的单词的变量。如果我理解你的代码没错，你是想把self.lines_seen中的内容写入self.finalFile，只要它还不在里面。如果你像现在这样用'if line not in fin'，那么结果可能不会如你所愿。例如，如果你的文件内容是：

lineone
linetwo
linethree

而且无序的lines_seen集合先返回'linethree'再返回'linetwo'，那么接下来会匹配'linethree'，但不会匹配'linetwo'，因为文件对象已经读取过'linetwo'了：

with open(self.finalFile,"r" as fin:
    for line in self.lines_seen:
        if line not in fin:
            print line

相反，可以考虑使用一个计数器：

from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
    for line in fin:
        line = line.rstrip() # remove the right-most whitespace/newline
        linecount[line] += 1
for word in self.lines_seen:
    if word not in linecount:
        out.add(word)

回答于 2025-04-18 由 Python大师

分享举报

fin 是一个文件句柄，所以你不能用 if line not in fin 这样的方式来比较。你需要先读取文件的内容。

with open(self.finalFile, "r") as fh:
    fin = fh.read().splitlines()   # fin is now a list of words from finalFile

for line in self.lines_seen: #lines_seen is a set with words
    if line not in fin:
        out.add(line)
    else:
        print line
# remove fin.close()

编辑：

因为 lines_seen 是一个集合，试着用 finalFile 中的单词创建一个新的集合，然后比较这两个集合的差异？

file_set = set()

with open(self.finalFile, "r") as fh:
    for f_line in fh:
        new_set.add(f_line.strip())

# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)

回答于 2025-04-18 由 Python大师

分享举报

文件中的字符串与集合中的字符串不匹配

2 个回答

撰写回答