Python，循环读取文件行；如果与另一个文件行相等，返回原始行

1 投票

8 回答

5354 浏览

提问于 2025-04-17 00:28

文本文件1的格式如下：

'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2

等等。

也就是说，每个单词后面跟着一个冒号和一个数字。

文本文件2的格式如下：

'WORD'
'WORD'

等等。

我需要从文件1中提取单个单词（也就是只要单个的WORD，而不是多个单词），如果这个单词在文件2中也有出现，就返回文件1中的单词和它对应的数字。

我有一些效果不太好的代码：

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = open(file2).readlines()   #file 2 as list -> 'WORD'
    ls_stripped = [x.strip('\n') for x in match_me_contents]  #get rid of newlines

    match_me_as_regex= re.compile("|".join(ls_stripped))   

    for line in target_contents:
        first_column = line.split(':')[0]  #get the first item in line.split
        number = line.split(':')[1]   #get the number associated with the word
        if len(first_column.split()) == 1: #get single word, no multiple words 
            """ Does the word from target contents match the word
            from match_me contents?  If so, return the line from  
            target_contents"""
            if re.findall(match_me_as_regex, first_column):  
                print first_column, number

#OUTPUT: WORD, n
         WORD, n
         etc.

因为使用了正则表达式，输出结果不太靠谱。比如，代码会返回'asset, 2'，这是因为re.findall()会把'match_me'中的'set'也匹配上。我需要把目标单词和'match_me'中的整个单词进行匹配，这样才能避免因为部分匹配而导致的错误输出。

正则表达式字符串比较文本匹配数据提取文件处理数据清洗行读取文件比较

8 个回答

我猜你说的“运行不太好”是指速度慢吧？因为我测试过，似乎是可以正常工作的。

你可以通过把文件2中的单词放到一个set（集合）里来提高效率：

word_set = set(ls_stripped)

然后，不用findall，你可以直接检查这个单词是否在集合里：

in_set = just_word in word_set

这样做感觉比用正则表达式要简单干净。

回答于 2025-04-17 由 Python大师

分享举报

看起来这可能只是grep的一个特殊用法。如果file2基本上是一个模式列表，而输出格式和file1一样，那么你可以这样做：

grep -wf file2 file1

这里的 -w 是告诉grep只匹配完整的单词。

回答于 2025-04-17 由 Python大师

分享举报

如果 file2 的大小不是特别大，可以把它的内容一次性读入一个集合中：

file2=set(open("file2").read().split())
for line in open("file1"):
    if line.split(":")[0].strip("'") in file2:
        print line

回答于 2025-04-17 由 Python大师

分享举报

Python，循环读取文件行；如果与另一个文件行相等，返回原始行

8 个回答

撰写回答