在Python中如何找出矩阵中所有可能的交点？

2 投票

4 回答

953 浏览

提问于 2025-04-15 23:52

假设我有一个矩阵（可以理解为一个列表的列表），其中每一列代表一个独特的单词，每一行代表一个不同的文档，而每个单元格的值是1或0，表示某个文档中是否包含对应列的单词。

我想知道的是，如何找出所有可能的单词和文档的组合，其中有多个单词与多个文档是共同存在的。结果可能看起来像这样：

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], 
  [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
  ...

接下来就是每一种可能的组合。我希望能找到一个解决方案，既能提供所有组合的完整集合，又能给出那些不是其他组合子集的组合。比如，[[Docid_3, Docid_5], ['word1', 'word17']]就不算，因为它是第一个例子的一个完整子集。

我觉得应该有一个优雅的解决方案，但就是想不起来，喝啤酒也没什么帮助。

谢谢。

数据结构逻辑运算矩阵组合数学交点文档分析单词组合子集

4 个回答

看看这个链接：关于推荐相关文章的可靠算法

对于实际问题，比如说超过100个文档和10000个单词，建议使用很不错的bitarray模块。
（顺便提一下，这个模块说“在Python中使用相同的算法大约比在C中慢20倍。”）

关于“只有那些不是另一个子集的组合”：
我们可以定义一个hit22为一个2x2的子矩阵，里面有11和11，
而hit23则是一个2x3的子矩阵，里面有111和111（也就是2个文档，3个共同的单词），依此类推。
一个特定的hit22可能会出现在很多hit2n中——2个文档，n个单词，
也可能出现在很多hitn2中——n个文档，2个单词。听起来挺有意思的。

补充：6月14日星期一，使用bitarray的小函数。
（关于真实文档分类的Python模块？不太清楚。）

""" docs-words with bitarray, randombits """
# google "document classification" (tutorial | python) ...
# https://stackoverflow.com/questions/1254627/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

from __future__ import division
import random
import sys
from bitarray import bitarray  # http://pypi.python.org/pypi/bitarray

__date__ = "14jun 2010 denis"

ndoc = 100
nbits = 1000

exec "\n".join( sys.argv[1:] )  # run this.py ndoc= ...
random.seed(1)
me = __file__.split('/') [-1]
print "%s  ndoc=%d  nbits=%d" % (me, ndoc, nbits)

    # bitarray stuff --

def bitslist( bits ):
    """ 011001 -> [1,2,5] """
    return [ j for j in range(len(bits)) if bits[j] ]

hex_01 = {
    "0": "0000", "1": "0001", "2": "0010", "3": "0011",
    "4": "0100", "5": "0101", "6": "0110", "7": "0111",
    "8": "1000", "9": "1001", "a": "1010", "b": "1011",
    "c": "1100", "d": "1101", "e": "1110", "f": "1111",
}

def to01( x, len_ ):
    x = "%x" % x
    s = "".join( hex_01[c] for c in x )
    return (len_ - len(s)) * "0"  + s

def randombits( nbits ):
    """ -> bitarray 1/16 1, 15/16 0 """
    hibit = 1 << (nbits - 1)
    r = (random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 ))  # prob 1/16
    return bitarray( to01( r, nbits ))

#...............................................................................
doc = [ randombits(nbits) for j in range(ndoc) ]  # ndoc x nbits

def mostsimilarpair():
    """ -> (sim, j, k) most similar pair of docs """
    mostsim = (-1,-1,-1)
    for j in range(ndoc):
        for k in range(j+1, ndoc):
                # allpairs[j,k] -> scipy.cluster.hier ?
            sim = (doc[j] & doc[k]).count()  # nr bits (words) in common, crude
            mostsim = max( mostsim, (sim,j,k) )
    return mostsim

sim, jdoc, kdoc = mostsimilarpair()
print "The 2 most similar docs:" ,
print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) ,
print "doc %d has %d," % ( kdoc, doc[kdoc].count() )
print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] ))
print ""

#...............................................................................
def docslike( jdoc, thresh ):
    """ -> (doc index, sim >= thresh) ... """
    for j in range(ndoc):
        if j == jdoc:  continue
        sim = (doc[j] & doc[jdoc]).count()
        if sim >= thresh:
            yield (j, sim)

thresh = sim // 2
print "Docs like %d, with >= %d words in common:" % (jdoc, thresh)
for j, sim in docslike( jdoc, thresh ):
    print "%3d  %s" % ( j, bitslist( doc[j] & doc[jdoc] ))

"""
The 2 most similar docs: doc 72 has 66 words, doc 84 has 60,
12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872]

Docs like 72, with >= 6 words in common:
  2  [3, 171, 258, 309, 592, 962]
...
"""

回答于 2025-04-15 由 Python大师

分享举报

首先，你需要建立一个文档ID和单词集合之间的映射关系——因为你现在的这个由0和1组成的矩阵直接处理起来非常麻烦。如果我理解得没错，"列标题"（单词）是矩阵中的第一个列表（去掉第一个项目），而"行标题"（文档ID）是每一行的第一个项目（去掉第一行）。接下来（假设你使用的是Python 2.6或更高版本）：

def makemap(matrix):
  im = iter(matrix)
  words = next(im)[1:]
  themap = {}
  for row in im:
    mapent = set()
    docid = row[0]
    for w, bit in zip(words, row[1:]):
      try:
        if bit: mapent.add(w)
      except:
        print 'w is %r' % (w,)
        raise
    themap[docid] = mapent
  return themap

现在你需要检查所有可行的文档子集——总的子集数量非常庞大，所以你真的希望尽可能减少搜索的范围，而简单地生成所有子集（比如通过itertools.combinations循环不同长度）显然不会进行任何的剪枝。

我会从所有的2组合开始（所有文档ID的对——当然可以用itertools.combinations来实现），然后生成第一批“可行的2长度子集”（那些有2个以上共同单词的对）。这些可以放在另一个映射中，使用tuple或frozenset的文档ID作为键。

接着，为了生成可行的(N+1)长度子集，我只会尝试通过添加一个新的文档ID来扩展现有的可行N长度子集（当然要检查总的交集是否仍然有2个以上的共同单词）。这样至少可以进行一些剪枝，而不是盲目尝试所有的2**N子集（甚至只是长度至少为2的2**N - N - 1子集）。

也许可以通过记录所有无法扩展某个N长度子集的文档ID来做得更好——因为没有必要再尝试那些文档ID与从该子集派生出的任何(N+1)长度子集。这是值得尝试的第二层剪枝/优化。

你可能还可以做进一步的调整来进行更多的剪枝，但我现在想不出具体的，所以我会从这里开始。（为了可读性，我下面不再使用微优化，比如用iteritems代替items，用frozenset代替tuple等等——考虑到这些序列都是O(N)，而计算结构的大小是指数级的，这些微优化可能并不重要，当然在调优/优化阶段尝试一下也是值得的）。

def allfeasiblesubsets(matrix):
  mapping = makemap(matrix)
  docids = sorted(mapping.keys())
  feasible_len2 = {}
  dont_bother = dict((d, set([d])) for d in docids)
  for d1, d2 in itertools.combinations(docids, 2):
    commonw = mapping[d1].intersection(mapping[d2])
    if len(commonw) >= 2:
      feasible_len2[d1, d2] = commonw
    else:
      dont_bother[d1].add(d2)
      dont_bother[d2].add(d1)
  all_feasible = [feasible_len2]
  while all_feasible[-1]:
    feasible_Np1 = {}
    for ds, ws in all_feasible[-1].items():  
      md = max(ds)        
      for d, w in mapping.items():
        if d <= md or any(d in dont_bother[d1] for d1 in ds):
          continue
        commonw = w.intersection(ws)
        if len(commonw) >= 2:
          feasible_Np1[ds+(d,)] = commonw
    all_feasible.append(feasible_Np1)
  return all_feasible[:-1]

你会注意到我只应用了我建议的“进一步剪枝”的一种温和形式——dont_bother只记录了一个文档ID与其他文档ID之间的“兼容性”（共同单词少于2个）——如果有多个这样的“兼容性文档ID”对，这可能会有所帮助，而且简单且相对不干扰，但在剪枝方面不如更严格的“完全”替代方案强大。我还试图将feasible*字典中的所有键保持为文档ID的排序元组（就像itertools.combinations最初提供的对一样），以避免重复，从而减少冗余工作。

这是我尝试的小例子（当然是在这些函数之后的同一个文件中，前面有import语句引入itertools和collections）：

mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
        ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
        ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]

mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs

结果看起来还不错，具体是：

{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]

不过当然可能还有潜在的错误，因为我还没有彻底测试过。顺便说一下，我希望这里提供的结果（一个包含不同长度的字典列表，每个字典的键是有序的文档ID元组形式，值是它们的共同单词集合）可以很容易地后处理成你可能更喜欢的任何其他形式，比如嵌套列表。

（虽然这无关紧要，但我在例子中使用的文本是一个古老的意大利谚语；-）。

回答于 2025-04-15 由 Python大师

分享举报

先把文本规范化。你只需要由小写字母组成的字符串，其他的都要去掉。
把这些字符串转成set集合。

可以用类似下面的代码来获取所有可能的组合，组合的大小可以不同：

def get_all_lengths_combinations_of(elements):
  for no_of_items in range(2, len(elements)+1):
    for items in itertools.combinations(elements, no_of_items)
        yield items

我相信真正的itertools高手会想出更好的方法，可能会用到izip()。

记得你可以这样使用set.intersection()方法：
```
set.intersection(*list_of_sets_to_intersect)
```

回答于 2025-04-15 由 Python大师

分享举报

在Python中如何找出矩阵中所有可能的交点？

4 个回答

撰写回答