逐元素比较元组以返回集合的集合 - python

2 投票

1 回答

610 浏览

提问于 2025-04-19 12:18

我刚开始学习Python，有人能帮我解决这个需求吗？我有一个数据集，第一行是属性名，后面的行是记录。

我的需求是比较每一条记录与其他记录，找出不同的元素，并给出这些元素的属性名。最后，我希望能得到一组集合作为输出。

举个例子，如果我有3条记录和3列数据，像这样：

         Col1 Col2 Col3
tuple1    H   C    G
tuple2    H   M    G
tuple3    L   M    S

那么输出应该是这样的：tuple1,tuple2 = {Col2}，tuple1,tuple3 = {Col1,Col2,Col3}，tuple2,tuple3 = {Col1,Col3}

最终的输出应该是 {{Col2},{Col1,Col2,Col3},{Col1,Col3}}

这是我尝试过的代码，

我现在做的是，把每一行读入一个列表。所有的属性放在一个列表里（这个列表叫list_attr），而记录则是一个列表的列表（这个列表叫rows）。然后对于每一条记录，我会和其他记录进行循环比较，找出不同的元素，并获取不同元素的索引来得到属性名。最后把这些属性名转换成集合。我在下面给出了代码，但问题是，我有5万条记录和15个属性需要处理，这样的循环执行起来很慢，有没有其他方法可以更快地完成这个任务或者提高性能呢？

dis_sets = []  
for l in rows:
    for l1 in rows:
        if l != l1:
            i = 0
            in_sets = []
            while(i < length):
                if l[i] != l1[i]:
                    in_sets.append(list_attr[i])
                i = i+1
            if in_sets != []:
                dis_sets.append(in_sets)
skt = set(frozenset(temp) for temp in dis_sets)

性能优化元组集合集合运算数据比较数据集记录处理属性名

1 个回答

考虑一下：

>>> tuple1=('H', 'C', 'G')
>>> tuple2=('H', 'M', 'G')
>>> tuple3=('L', 'M', 'S')

好的，你说“我的需求是比较每条记录与其他记录，并给出不同元素的属性名称。”

把这个放到代码里：

>>> [i for i, t in enumerate(zip(tuple1, tuple2), 1) if t[0]!=t[1]]
[2]
>>> [i for i, t in enumerate(zip(tuple1, tuple3), 1) if t[0]!=t[1]]
[1, 2, 3]
>>> [i for i, t in enumerate(zip(tuple2, tuple3), 1) if t[0]!=t[1]]
[1, 3]

然后你说“最终的输出应该是 {{Col2},{Col1,Col2,Col3},{Col1,Col3}}。”

因为一组集合是没有顺序的，这样说是没有意义的。应该是：

>>> [[i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair in 
...     [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[[2], [1, 2, 3], [1, 3]]

如果你真的想要集合，可以把它们作为子元素；如果你有一个真正的集合的集合，你就失去了哪些对是哪些的信息。

集合的列表：

>>> [{i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in 
...     [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[set([2]), set([1, 2, 3]), set([1, 3])]

而你几乎想要的输出是：

>>> [{'Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in 
...     [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[set(['Col2']), set(['Col2', 'Col3', 'Col1']), set(['Col3', 'Col1'])]

（注意，由于集合是无序的，字符串的顺序会改变。如果最外层的顺序改变了，你会得到什么呢？）

注意，如果你有一个列表的列表，你就更接近你想要的输出了：

>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair 
...     in [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]

根据评论进行编辑

你可以做类似这样的事情：

def pairs(LoT):
                   # for production code, consider using a deque of tuples...
    seen=set()     # hold the pair combinations seen
    while LoT:
        f=LoT.pop(0)
        for e in LoT:
            se=frozenset([f, e])
            if se not in seen:
                seen.add(se)
                yield se

 >>> list(pairs([('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')]))
 [frozenset([('H', 'M', 'G'), ('H', 'C', 'G')]), frozenset([('L', 'M', 'S'), ('H', 'C', 'G')]), frozenset([('H', 'M', 'G'), ('L', 'M', 'S')])]

然后可以这样使用：

>>> LoT=[('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')]
>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair 
...        in pairs(LoT)]
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]

编辑 #2

如果你想要一个头部和一个计算值：

>>> theader=['tuple col 1', 'col 2', 'the third' ]
>>> [[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]] for pair
...       in pairs(LoT)]
[['col 2'], ['tuple col 1', 'col 2', 'the third'], ['tuple col 1', 'the third']]

如果你想要（我怀疑这是正确的答案）一个字典的列表，里面是列表：

>>> di=[]
>>> for pair in pairs(LoT):    
...    di.append({repr(list(pair)): [theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]]})
>>> di
[{"[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}, {"[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third']}, {"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third']}]

或者，直接一个字典的列表：

>>> di={}
>>> for pair in pairs(LoT):    
...    di[repr(list(pair))]=[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]]  
>>> di
{"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third'], "[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third'], "[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}

回答于 2025-04-19 由 Python大师

分享举报

逐元素比较元组以返回集合的集合 - python

1 个回答

撰写回答