在Python中匹配多个数据集的字符串

1 投票

3 回答

1582 浏览

提问于 2025-04-17 08:12

我正在使用Python，想要匹配几个数据文件中的字符串。首先，我用pickle来解压我的文件，然后把它们放到一个列表里。我只想匹配那些条件相同的字符串，这些条件在字符串的末尾有标识。

我现在的脚本大致是这样的：

import pickle

f = open("data_a.dat")
list_a = pickle.load( f )
f.close()

f = open("data_b.dat")
list_b = pickle.load( f )
f.close()

f = open("data_c.dat")
list_c = pickle.load( f )
f.close()

f = open("data_d.dat")
list_d = pickle.load( f )
f.close()


for a in list_a:
    for b in list_b:
        for c in list_c
            for d in list_d:
                 if a.GetName()[12:] in b.GetName(): 
                      if a.GetName[12:] in c.GetName():
                         if a.GetName[12:] in d.GetName():
                              "do whatever"

对于这两个列表来说，这个方法似乎很好用。但是当我尝试添加更多的8到9个数据文件，并且需要匹配相同的条件时，问题就来了。脚本根本无法处理，卡住了。我很感谢大家的帮助。

补充说明：每个列表包含的是直方图，名字是根据创建它们时使用的参数命名的。直方图的名字里包含了这些参数和它们的值，通常在字符串的末尾。在这个例子中，我是为两个数据集做的，现在我想为9个数据集做，而不想使用多个循环。

补充说明2：我刚刚扩展了代码，以更准确地反映我想做的事情。现在如果我尝试对9个列表这样做，不仅看起来很糟糕，而且也不管用。

列表操作字符串处理直方图脚本优化数据匹配条件匹配数据文件多数据集

3 个回答

这里有一小段代码可以给你一些灵感。主要的想法是使用一个递归函数。

为了简单起见，我假设我已经把数据加载到列表里，但你可以先从文件中获取这些数据：

data_files = [
    'data_a.dat',
    'data_b.dat',
    'data_c.dat',
    'data_d.dat',
    'data_e.dat',
]

lists = [pickle.load(open(f)) for f in data_files]

因为我并不完全了解你真正需要做的细节，所以我在这里的目标是找到前四个字符的匹配：

def do_wathever(string):
    print "I have match the string '%s'" % string

lists = [
    ["hello", "world", "how", "grown", "you", "today", "?"],
    ["growl", "is", "a", "now", "on", "appstore", "too bad"],
    ["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]

positions = [0 for i in range(len(lists))]

def recursive_match(positions, lists):
    strings = map(lambda p, l: l[p], positions, lists)
    match = True
    searched_string = strings.pop(0)[:4]
    for string in strings:
        if searched_string not in string:
            match = False
            break
    if match:
        do_wathever(searched_string)


    # increment positions:
    new_positions = positions[:]
    lists_len = len(lists)
    for i, l in enumerate(reversed(lists)):
        max_position = len(l)-1
        list_index = lists_len - i - 1
        current_position = positions[list_index]
        if max_position > current_position:
            new_positions[list_index] += 1
            break
        else:
            new_positions[list_index] = 0
            continue

    return new_positions, not any(new_positions)


search_is_finished = False

while not search_is_finished:
    positions, search_is_finished = recursive_match(positions, lists)

当然，你可以在这里优化很多东西，这只是草稿代码，但可以看看这个递归函数，这是一个很重要的概念。

回答于 2025-04-17 由 Python大师

分享举报

最后我决定使用内置的 map 函数。我现在意识到我应该更清楚地表达我的意思（以后我会做到这一点）。

我的数据文件是直方图，有5个参数，有些则有3个或4个。大致是这样的，

par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]

我需要检查每个参数值组合下绘制的数量的表现。最后，我得到一个数据文件，里面有大约300个直方图，每个直方图的名字里都包含了对应的参数值和样本名称。看起来大概是这样的，

datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"

所以我为每个9个数据文件得到了300个直方图，但幸运的是，这些直方图都是按照相同的顺序创建的。因此，我可以仅仅使用 map 函数将它们配对。我把数据文件解包，把每个文件放到列表中，然后用 map 函数将每个直方图与其他数据样本中对应的配置配对。

for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):  
  do_something(lst)

这解决了我的问题。感谢大家的帮助！

回答于 2025-04-17 由 Python大师

分享举报

我随便想的：

files = ["file_a", "file_b", "file_c"]
sets = []

for f in files:
    f = open("data_a.dat")
    sets.append(set(pickle.load(f)))
    f.close()

intersection = sets[0].intersection(*sets[1:])

补充说明：我之前忽略了你对 x.GetName()[12:] 的映射，不过你应该可以把你的问题简化为集合逻辑。

回答于 2025-04-17 由 Python大师

分享举报

在Python中匹配多个数据集的字符串

3 个回答

撰写回答