Python 列表问题

1 投票
5 回答
1056 浏览
提问于 2025-04-15 12:27

我遇到了一些问题,希望能得到帮助。我有一个Python列表,内容如下:

fail = [
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', 'svin.txt']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', 'apa2.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']

sha1 value, directory, filename

我想根据sha1值和目录把这些内容分成两个不同的列表。举个例子。

['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']

我想把那些在同一个目录下且sha1值相同的内容添加到列表duplicate = []中(并且只限于那个目录)。其他的条目我想放到另一个列表,比如diff = [],因为它们的sha1值相同,但目录不同。

我对这个逻辑有点迷糊,所以任何帮助我都非常感激!

编辑:修正了一个拼写错误,最后一个值(文件名)在某些情况下是一个只有一个元素的列表,这完全不正确,感谢SilentGhost让我意识到这个问题。

5 个回答

1

在下面的代码示例中,我使用一个基于SHA1和目录名称的键来检测唯一和重复的条目,并为管理工作留出字典空间。

# Test dataset
fail = [
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'],
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'],
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'],
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'],
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']],
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']],
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']],
]


def sort_duplicates(filelist):
    """Returns a tuplie whose first element is a list of unique files,
    and second element is a list of duplicate files.
    """
    diff = []
    diff_d = {}

    duplicate = []
    duplicate_d = {}

    for entry in filelist:

        # Make an immutable key based on the SHA-1 and directory strings
        key = (entry[0], entry[1])

        # If this entry is a known duplicate, add it to the duplicate list
        if key in duplicate_d:
            duplicate.append(entry)

        # If this entry is a new duplicate, add it to the duplicate list
        elif key in diff_d:
            duplicate.append(entry)
            duplicate_d[key] = entry

            # And relocate the matching entry to the duplicate list
            matching_entry = diff_d[key]
            duplicate.append(matching_entry)
            duplicate_d[key] = matching_entry
            del diff_d[key]
            diff.remove(matching_entry)

        # Otherwise add this entry to the different list
        else:
            diff.append(entry)
            diff_d[key] = entry

    return (diff, duplicate)

def test():
    global fail
    diff, dups = sort_duplicates(fail)
    print "Diff:", diff
    print "Dups:", dups

test()
1

你可以简单地遍历所有的值,然后用一个内部循环来比较目录。如果目录相同,再比较值,最后把它们放到列表里。这样做的话,你就能得到一个比较简单的n^2算法来整理这些数据。

可能像这样的一段未经测试的代码:

>>>for i in range(len(fail)-1):
...   dir = fail[i][1]
...   sha1 = fail[i][0]
...   for j in range(i+1,len(fail)):
...      if dir == fail[j][1]: #is this how you compare strings?
...         if sha1 == fail[j][0]:
...            #remove from fail and add to duplicate and add other to diff

再说一次,这段代码是未经测试的。

3
duplicate = []
# Sort the list so we can compare adjacent values
fail.sort()
#if you didn't want to modify the list in place you can use:
#sortedFail = sorted(fail)
#      and then use sortedFail in the rest of the code instead of fail
for i, x in enumerate(fail):
    if i+1 == len(fail):
        #end of the list
        break
    if x[:2] == fail[i+1][:2]:
        if x not in duplicate:
            duplicate.add(x)
        if fail[i+1] not in duplicate:
            duplicate.add(fail[i+1])
# diff is just anything not in duplicate as far as I can tell from the explanation
diff = [d for d in fail if d not in duplicate]

根据你给的例子输入

duplicate: [
              ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']], 
              ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']
           ]

diff: [
          ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']], 
          ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'], 
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']],
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'],
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py']
      ]

我可能有些理解错了,但我觉得这就是你想问的内容。

撰写回答