查找fi中最相似的行

import os import subprocess import getpass import sys import difflib from difflib import SequenceMatcher as SM user = getpass.getuser() print(os.getcwd()) exeFile = (os.getcwd() + "/paths/programpaths.txt") def get_filepaths(directory): file_paths = [] # List which will store all of the full filepaths. exes = open(os.getcwd() + "/paths/programpaths.txt", "w+") # Walk the tree. for root, directories, files in os.walk(directory): for filename in files: # Join the two strings in order to form the full filepath. filepath = os.path.join(root, filename) file_paths.append(filepath) # Add it to the list. if filepath.endswith('exe') and "ninstall" not in filepath and "$RECYCLE.BIN" not in filepath: files = filepath.encode('cp850', errors='replace').decode('cp850') #print(files + "\n") exes.write(files + "\n") return file_paths # Self-explanatory. if not os.path.exists(exeFile): print("List compilation should only happen once") print() print("Compiling list of installed programs") print("This may take a while") exes = open(os.getcwd() + "/paths/programpaths.txt", "a+") full_file_pathsx64 = get_filepaths('C:\Program Files') full_file_pathsx86 = get_filepaths('C:\Program Files (x86)') full_file_pathsgames = get_filepaths('G:\\') # Run the above function and store its results in a variable. print("List compilation should only happen once") print() print("Done!") pinput = input() for line in open(exeFile): prog = line.split("\\")[-1] sim = difflib.get_close_matches(pinput, [prog], 1) print(sim)

3条回答

网友

1楼 · 编辑于 2024-04-25 13:50:21

基于您现在发布的完整代码，我的解决方案可能是解决您问题的最佳方法：

with open(exeFile) as f:
    programs = { path.rsplit('\\', 1)[-1].rstrip()[:-4].lower(): path.strip() for path in f }

sim = difflib.get_close_matches(pinput.lower(), programs.keys(), 1)
if sim:
    print(programs[sim[0]])

字典理解的魔力就发生了。对于文件中的每个path，我们生成以下名称作为字典条目的键：

^{pr2}$

所以假设一个文件路径像C:\Adobe\Audition CC 2014\Audition CC 2014.exe，它首先从右边的被斜杠分割一次，然后取最后一个元素，这样我们就得到Audition CC 2014.exe。接下来，我们去除空白，然后根据生成exefile的方式，我们知道.exe是文件名的一部分。所以我们有Audition CC 2014。下一步，我们降低大小写，这样就有了更好的可比性（因为difflib区分大小写）。在

在比较中，我们只需从字典的键（即小写的程序名）中获得接近的匹配项。我们将其与小写用户输入进行比较。在

一旦得到结果，我们就打印出属于匹配密钥的路径。这就是我们在上面构建字典的原因；否则我们将不得不再次搜索该文件以找到完整路径。在

网友

2楼 · 编辑于 2024-04-25 13:50:21

get_close_matches(…, 1)调用将返回一个空列表，或者返回一个只包含一个匹配项的列表。在

用英语你想做的是：

如果它有一个元素，打印它。在
否则，什么都别做。在

直接将其转换为python：

if sim:
    print(sim[0])

（您可以将else: pass写成“否则，不要做任何事情”，或者您不能写任何东西。）

这解决了“不打印[]每行，只打印匹配项”。在

但这也带来了另一个问题：你实际上得不到任何匹配。在

正如poke在评论中所解释的，get_close_matches的第二个参数是一个检查可能性的列表，但是您传递的值prog是一个字符串。在

如果不清楚为什么它是一个字符串，请看这一行：

^{pr2}$

您split将该字符串放入一个较小的字符串列表中，然后使用[-1]只获取最后一个字符串。在

如果您好奇为什么没有得到错误：字符串本身就是一个字符串序列，每个字符对应一个字符串。所以，如果prog是"abcde"，那么你就要求它把['a', 'b', 'c', 'd', 'e']作为5个独立的可能性来处理，这是一件非常合理的事情，它只是不太可能匹配任何东西。在

我想你在这里想要的可能只是传递一个列表，列出以下一种可能性：

sim = difflib.get_close_matches(pinput, [prog], 1)

或者，您也可以建立一个包含所有可能性的大列表，然后一次搜索所有可能性，而不是一次搜索一个

progs = []
for line in open(exefile):
    progs.append(line.split("\\")[-1])
sim = difflib.get_close_matches(pinput, progs, 1)

但是在整个文件中只得到1个匹配总数，而不是每行一个可能的匹配。如果你想要多于1个总数，你可以这样做，但是我不确定它在一个巨大的数字上有多好。（你可以试试看。）

不管怎样，希望你能理解你真正想要的，而不必去猜测。：）

网友

3楼 · 编辑于 2024-04-25 13:50:21

理解你真正想做什么总是很好的

首先定义你所说的“最近”是什么意思（通常在字符串中这被称为汉明距离）

def hamming_distance(s1,s2):
    #first elimate non-letters
    s1 = re.sub("[^a-zA-Z]","",s1)
    s2 = re.sub("[^a-zA-Z]","",s2)
    #the distance is the sum of all instance with differing letters in this case
    return sum(a!=b for a,b in izip_longest(s1,s2))

然后您只需遍历该文件并找到“最接近的匹配项”

^{pr2}$

一旦你理解了这个魔法，你可能会使用difflib获得一个边际加速

相关问题更多 >

编程相关推荐

热门问题

热门文章