在Python中匹配大列表与字符串的最佳方法

2 投票

2 回答

1779 浏览

提问于 2025-04-16 18:13

我有一个包含大约700个词的Python列表，这些词我想用作Django数据库条目的元数据。我想把这个列表里的词和条目的描述进行匹配，看看有没有相同的词，但遇到了一些问题。第一个问题是，列表中有一些是多词的短语，这些短语里包含了其他条目中的词。举个例子：

Intrusion
Intrusion Detection

我在使用re.findall时进展不大，因为它会同时匹配“Intrusion”和“Intrusion Detection”。我只想匹配“Intrusion Detection”，而不是“Intrusion”。

有没有更好的方法来进行这种匹配？我想过试试NLTK，但看起来它对这种匹配没有帮助。

编辑：

为了更清楚一点，我有一个包含700个词的列表，比如“防火墙”或“入侵检测”。我想把这些词和我存储在数据库中的描述进行匹配，看看有没有相同的词，然后把这些词用作元数据。所以如果我有以下字符串：

There are many types of intrusion detection devices in production today.

而且如果我有一个包含以下词的列表：

Intrusion
Intrusion Detection

我想匹配“入侵检测”，但不想匹配“入侵”。其实我还希望能匹配单数和复数的情况，不过我可能想得有点多了。整个想法是把所有匹配的结果放到一个列表中，然后进行处理。

正则表达式字符串匹配自然语言处理词汇匹配数据库条目元数据处理短语匹配单复数匹配

2 个回答

这个问题有点不清楚，不过我理解你的意思是你有一个主列表，里面列了一些术语，每个术语占一行。接下来你有一份测试数据，其中一些数据会在主列表里，而有些则不会。你想检查测试数据是否在主列表中，如果在的话就执行某个任务。

假设你的主列表长这样：

入侵检测
防火墙
FooBar

而你的测试数据长这样：

入侵
入侵检测
foo
bar

这个简单的脚本应该能帮你找到方向：

#!/usr/bin/env python

import sys 

def main():
  '''useage tester.py masterList testList'''   


  #open files
  masterListFile = open(sys.argv[1], 'r')
  testListFile = open(sys.argv[2], 'r')

  #bulid master list
  # .strip() off '\n' new line
  # set to lower case. Intrusion != intrusion, but should.
  masterList = [ line.strip().lower() for line in masterListFile ]
  #run test
  for line in testListFile:
    term = line.strip().lower()
    if term  in masterList:
      print term, "in master list!"
      #perhaps grab your metadata using a like %%
    else:
      print "OH NO!", term, "not found!"

  #close files
  masterListFile.close()
  testListFile.close()

if __name__ == '__main__':
  main()

示例输出

哦不！入侵没有找到！
入侵检测在主列表中！
哦不！foo没有找到！
哦不！bar没有找到！

还有其他几种方法可以做到这一点，但这个应该能给你指明方向。如果你的列表很大（700其实并不算大），可以考虑使用字典，我觉得这样会更快，特别是如果你打算查询数据库的话。也许字典的结构可以是 {术语: 术语相关信息}。

回答于 2025-04-16 由 Python大师

分享举报

如果你需要更灵活的方式来匹配条目的描述，可以把 nltk 和 re 结合起来使用。

from nltk.stem import PorterStemmer
import re

假设你有不同的描述来表示同一个事件，比如 系统的重写。你可以使用 nltk.stem 来捕捉到 重写、重写中、重写的 等各种形式，包括单数和复数等等。

master_list = [
    'There are many types of intrusion detection devices in production today.',
    'The CTO approved a rewrite of the system',
    'The CTO is about to approve a complete rewrite of the system',
    'The CTO approved a rewriting',
    'Breaching of Firewalls'
]

terms = [
    'Intrusion Detection',
    'Approved rewrite',
    'Firewall'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

输出：

['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']

['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']

编辑：

如果想知道是哪个 terms 导致了匹配，可以查看一下：

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print('TERM: {0} FOUND IN: {1}'.format(term, sentence))

输出：

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

回答于 2025-04-16 由 Python大师

分享举报

在Python中匹配大列表与字符串的最佳方法

2 个回答

撰写回答