使用Python在文件中搜索列表项所在行

2 投票

3 回答

6232 浏览

提问于 2025-04-18 17:10

我有一个文本文件，里面有成千上万行的ASCII文本。我有一个包含几百个关键词的列表，我想逐行搜索这些关键词。如果找到匹配的行，最开始我想把这些行打印出来（可以是显示在屏幕上或写入文件），但最终我希望能根据匹配的数量对返回的行进行排序。

所以，我的关键词列表大概是这样的……

keywords = ['one', 'two', 'three']

我想到的思路是这样的：

myfile = open('file.txt')
for line in myfile:
    if keywords in line:
        print line

但是把这个想法变成实际可用的代码却没有成功。

我还考虑过使用正则表达式（RegEx）：

print re.findall(keywords, myfile.read())

但这又让我遇到了一些不同的错误和问题。

如果有人能提供一些指导、语法或代码片段，我将非常感激。

正则表达式文本处理排序算法文本文件编程指导数据过滤关键词搜索行匹配

3 个回答

在Python的collections模块里，有个叫做Counter的东西，看起来非常适合解决这个问题。我会这样做。

from collections import Counter

keywords = ['one', 'two', 'three']
lines = ['without any keywords', 'with one', 'with one and two']

matches = []
for line in lines: 
    # Takes all the words in the line and gets the number of times 
    # they appear as a dictionary-like Counter object.
    words = Counter(line.split())

    line_matches = 0
    for kw in keywords:
        # Get the number of times it popped up in the line
        occurrences = words.get(kw, 0)
        line_matches += occurrences

    matches.append((line, line_matches))

# Sort by the number of occurrences per line, descending.
print(sorted(matches, key=lambda x: x[1], reverse=True))

这样运行后，会输出：

[('with one and two', 2), ('with one', 1), ('without any keywords', 0)]

回答于 2025-04-18 由 Python大师

分享举报

你在问题中没有说明，但我认为如果一个关键词出现多次，得分时只算一次（这样可以让包含更多不同关键词的行更有优势）：

def getmatching(lines, keywords):
    result = []
    keywords = set(keywords)
    for line in lines:
        matches = len(keywords & set(line.split()))
        if matches:
            result.append((matches, line))
    return (line for matches, line in sorted(result, reverse=True))

例子

lines = ['no keywords here', 'one keyword here',
         'two keywords in this one line', 'three minus two equals one',
         'one counts only one time because it is only one keyword']

keywords = ['one', 'two', 'three']

for line in getmatching(lines, keywords):
    print line

输出

three minus two equals one
two keywords in this one line
one keyword here
one counts only one time because it is only one keyword

回答于 2025-04-18 由 Python大师

分享举报

你不能直接检查一个字符串里是否有列表。你可以检查一个字符串里是否包含另一个字符串。

lines = ['this is a line without any keywords', 
         'this is a line with one', 
         'this is a line with one and two',
         'this is a line with three']
keywords = ['one', 'two', 'three']

for line in lines:
    for word in keywords:
        if word in line:
            print(line)
            break

在“单词”循环中，当找到第一个匹配的单词时，break是必要的，它可以让你跳出循环。否则，它会对每一个匹配的单词都打印一遍这一行。

使用正则表达式的解决方案也有同样的问题。你可以使用我上面提到的解决方案，并在单词上再加一个循环，或者你可以构造一个正则表达式，让它自动匹配任何单词。可以查看Python正则表达式语法的文档。

for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        print(line, len(matches))

注意，re.findall如果没有匹配项，会返回一个空列表；如果有匹配项，则返回所有匹配项的列表。所以我们可以直接在if条件中测试结果，因为空列表会被认为是False。

你也可以很容易地为这些简单的情况生成正则表达式模式：

pattern = '|'.join(keywords)
print(pattern)
# 'one|two|three'

要对它们进行排序，你只需将它们放在一个元组列表中，并使用sorted的key参数。

results = []
for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

你可以查看sorted的文档，但key参数提供了一个用于排序的函数。在这个例子中，我们提取每个元组的第二个元素，也就是我们存储该行匹配次数的地方，然后用这个进行排序。

这就是你如何将这个应用到实际文件中并保存结果。

keywords = ['one', 'two', 'three']
pattern = '|'.join(keywords)

results = []
with open('myfile.txt', 'r') as f:
    for line in f:
        matches = re.findall(pattern, line)
        if matches:
            results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

with open('results.txt', 'w') as f:
    for line, num_matches in results:
        f.write('{}  {}\n'.format(num_matches, line))

你可以了解一下with上下文管理器，但在这种情况下，它基本上确保你在完成后关闭文件。

回答于 2025-04-18 由 Python大师

分享举报

使用Python在文件中搜索列表项所在行

3 个回答

例子

撰写回答