如何为给定外群的一组物种生成所有可能的Newick树排列？

2条回答

网友

1楼 · 编辑于 2024-05-14 06:46:58

作为叶子集合的递归集合的树

让我们暂时把newick表示放在一边，考虑一下这个问题的一个可能的python表示。在

根树可以看作是（set of（sets of（sets of…））叶的递归层次结构。集合是无序的，这非常适合于描述树中的类：{{{"A", "B"}, {"C", "D"}}, "E"}应该与{}相同。在

如果我们考虑叶的初始集{"A", "B", "C", "D", "E"}，那么以“E”为外群的树是形式为{tree, "E"}的集合，其中tree来自可以从叶子集{}构建的树集。我们可以尝试编写一个递归的trees函数来生成这组树，我们的全部树集将表示如下：

{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}

（这里，我使用set comprehension表示法。）

实际上，python不允许集合集合，因为集合的元素必须是“散列的”（也就是说，python必须能够计算对象的一些“散列”值，以便能够检查它们是否属于集合）。python集合没有这个属性。幸运的是，我们可以使用一个类似的名为^{}的数据结构，它的行为非常类似于一个集合，但是不能被修改并且是“散列”的。因此，我们的一组树将是：

^{pr2}$
实现trees函数
现在让我们关注trees函数。在
对于叶集的每个可能的分区（分解为一组不相交的子集，包括所有元素），我们需要为分区的每个部分找到所有可能的树（通过递归调用）。对于一个给定的分区，我们将为每个可能的子树组合生成一个树。在
例如，如果一个分区是{"A", {"B", "C", "D"}}，我们将考虑所有可以由部分"A"生成的树（实际上，只是叶"A"本身），以及可以由部分{"B", "C", "D"}（即trees({"B", "C", "D"})）生成的所有可能的树。然后，通过获取所有可能的对，其中一个元素来自"A"，另一个元素来自trees({"B", "C", "D"})，就可以得到这个分区的可能树。在
这可以推广到包含两个以上部分的分区，并且itertools中的product函数在这里似乎很有用。在
因此，我们需要一种方法来生成一组叶子的可能分区。在
生成集合的划分
在这里，我根据this solution创建了一个partitions_of_set函数：
# According to https://stackoverflow.com/a/30134039/1878788: # The problem is solved recursively: # If you already have a partition of n-1 elements, how do you use it to partition n elements? # Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset. def partitions_of_set(s): if len(s) == 1: yield frozenset(s) return # Extract one element from the set # https://stackoverflow.com/a/43804050/1878788 elem, *_ = s rest = frozenset(s - {elem}) for partition in partitions_of_set(rest): for subset in partition: # Insert the element in the subset try: augmented_subset = frozenset(subset | frozenset({elem})) except TypeError: # subset is actually an atomic element augmented_subset = frozenset({subset} | frozenset({elem})) yield frozenset({augmented_subset}) | (partition - {subset}) # Case with the element in its own extra subset yield frozenset({elem}) | partition
为了检查获得的分区，我们创建了一个函数，使它们更易于显示（这对于以后生成树的newick表示也很有用）：
def print_set(f): if type(f) not in (set, frozenset): return str(f) return "(" + ",".join(sorted(map(print_set, f))) + ")"
我们测试分区是否有效：
for partition in partitions_of_set({"A", "B", "C", "D"}): print(len(partition), print_set(partition))
输出：
1 ((A,B,C,D)) 2 ((A,B,D),C) 2 ((A,C),(B,D)) 2 ((B,C,D),A) 3 ((B,D),A,C) 2 ((A,B,C),D) 2 ((A,B),(C,D)) 3 ((A,B),C,D) 2 ((A,D),(B,C)) 2 ((A,C,D),B) 3 ((A,D),B,C) 3 ((A,C),B,D) 3 ((B,C),A,D) 3 ((C,D),A,B) 4 (A,B,C,D)
trees函数
的实际代码
现在我们可以编写tree函数：
from itertools import product def trees(leaves): if type(leaves) not in (set, frozenset): # It actually is a single leaf yield leaves # Don't try to yield any more trees return # Otherwise, we will have to consider all the possible # partitions of the set of leaves, and for each partition, # construct the possible trees for each part for partition in partitions_of_set(leaves): # We need to skip the case where the partition # has only one subset (the initial set itself), # otherwise we will try to build an infinite # succession of nodes with just one subtree if len(partition) == 1: part, *_ = partition # Just to be sure the assumption is correct assert part == leaves continue # We recursively apply *tree* to each part # and obtain the possible trees by making # the product of the sets of possible subtrees. for subtree in product(*map(trees, partition)): # Using a frozenset guarantees # that there will be no duplicates yield frozenset(subtree)
测试：
all_trees = frozenset( {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})}) for tree in all_trees: print(print_set(tree) + ";")
输出：
(((B,C),A,D),E); ((((A,B),D),C),E); ((((B,D),A),C),E); ((((C,D),A),B),E); (((A,D),B,C),E); ((A,B,C,D),E); ((((B,D),C),A),E); (((A,B,C),D),E); ((((A,C),B),D),E); ((((C,D),B),A),E); ((((B,C),A),D),E); (((A,B),C,D),E); (((A,C),(B,D)),E); (((B,D),A,C),E); (((C,D),A,B),E); ((((A,B),C),D),E); ((((A,C),D),B),E); (((A,C,D),B),E); (((A,D),(B,C)),E); ((((A,D),C),B),E); ((((B,C),D),A),E); (((A,B),(C,D)),E); (((A,B,D),C),E); ((((A,D),B),C),E); (((A,C),B,D),E); (((B,C,D),A),E);
我希望结果是正确的。在
这种方法有点棘手。我花了一些时间来找出如何避免无限递归（当分区是{{"A", "B", "C", "D"}}时就会发生这种情况）。在

网友
2楼 · 编辑于 2024-05-14 06:46:58

这是个很难回答的问题！这是我的旅程。在
第一个观察是outgroup始终是一个连接到newick字符串末尾的节点。让我们把其余的物种称为内群，并尝试生成这些物种的所有排列。然后简单地添加outgroup。在
from itertools import permutations def ingroup_generator(species, n): for perm in permutations(species, n): yield tuple([tuple(perm), tuple(s for s in species if s not in perm)]) def format_newick(s, outgroup=''): return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup) species = ["A","B","C","D","E"] outgroup = "E" ingroup = [s for s in species if s != outgroup] itertools_newicks= [] for n in range(1, len(ingroup)): for p in ingroup_generator(ingroup, n): itertools_newicks.append(format_newick(p, outgroup)) for newick in itertools_newicks: print newick
这将返回40个newick字符串：
^{pr2}$
其中有些是重复的，但我们稍后将删除这些重复项。在
作为bli noted in the comments，(((("A","B"),"C"),"D"),("E"));及其变体也应被视为有效解。 comments on BioStar向我指出了正确的方向，这与生成二叉树的所有可能的分组是一样的。我找到了一个漂亮的Python implementation in this StackOverflow answer by rici：
# A very simple representation for Nodes. Leaves are anything which is not a Node. class Node(object): def __init__(self, left, right): self.left = left self.right = right def __repr__(self): return '(%s, %s)' % (self.left, self.right) # Given a tree and a label, yields every possible augmentation of the tree by # adding a new node with the label as a child "above" some existing Node or Leaf. def add_leaf(tree, label): yield Node(label, tree) if isinstance(tree, Node): for left in add_leaf(tree.left, label): yield Node(left, tree.right) for right in add_leaf(tree.right, label): yield Node(tree.left, right) # Given a list of labels, yield each rooted, unordered full binary tree with # the specified labels. def enum_unordered(labels): if len(labels) == 1: yield labels[0] else: for tree in enum_unordered(labels[1:]): for new_tree in add_leaf(tree, labels[0]): yield new_tree
然后
enum_newicks= [] for t in enum_unordered(ingroup): enum_newicks.append('({},({}));'.format(t, outgroup)) for newick in enum_newicks: print newick
生成以下15个新图标：
((A, (B, (C, D))),(E)); (((A, B), (C, D)),(E)); ((B, (A, (C, D))),(E)); ((B, ((A, C), D)),(E)); ((B, (C, (A, D))),(E)); ((A, ((B, C), D)),(E)); (((A, (B, C)), D),(E)); ((((A, B), C), D),(E)); (((B, (A, C)), D),(E)); (((B, C), (A, D)),(E)); ((A, (C, (B, D))),(E)); (((A, C), (B, D)),(E)); ((C, (A, (B, D))),(E)); ((C, ((A, B), D)),(E)); ((C, (B, (A, D))),(E));
所以现在我们已经有了40+15=55个可能的newick字符串，我们必须删除重复项。在
我尝试的第一个死胡同是为每个newick字符串创建一个规范化的表示，这样我就可以将它们用作字典中的键。其想法是递归地对所有节点中的字符串进行排序。但首先我必须捕获所有（嵌套）节点。我不能使用正则表达式，因为nested structures are by definition not regular。在
因此，我使用了pyparsing包，并提出了这个：
from pyparsing import nestedExpr def sort_newick(t): if isinstance(t, str): return sorted(t) else: if all(isinstance(c, str) for c in t): return sorted(t) if all(isinstance(l, list) for l in t): return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))] else: return [sort_newick(l) for l in t] def canonical_newick(n): n = n.replace(',', '') p = nestedExpr().parseString(n).asList() s = sort_newick(p) return str(s)
这让我屈服了
from collections import defaultdict all_newicks = itertools_newicks + enum_newicks d = defaultdict(list) for newick in all_newicks: d[canonical_newick(newick)].append(newick) for canonical, newicks in d.items(): print canonical for newick in newicks: print '\t', newick print
有22个键的字典：
[[[['A'], [['C'], ['B', 'D']]], ['E']]] ((A, (C, (B, D))),(E)); [[[['B'], [['A'], ['C', 'D']]], ['E']]] ((B, (A, (C, D))),(E)); [[[['B'], [['A', 'C'], ['D']]], ['E']]] ((B, ((A, C), D)),(E)); [[['A', 'C', 'D'], ['B'], ['E']]] ((B), (A, C, D),(E)); ((A, C, D), (B),(E)); ((A, D, C), (B),(E)); ((C, A, D), (B),(E)); ((C, D, A), (B),(E)); ((D, A, C), (B),(E)); ((D, C, A), (B),(E)); [[['A', 'B'], ['C', 'D'], ['E']]] ((A, B), (C, D),(E)); ((B, A), (C, D),(E)); ((C, D), (A, B),(E)); ((D, C), (A, B),(E)); [[[[['A'], ['B', 'C']], ['D']], ['E']]] (((A, (B, C)), D),(E)); [[[['A', 'C'], ['B', 'D']], ['E']]] (((A, C), (B, D)),(E)); [[['A'], ['B', 'C', 'D'], ['E']]] ((A), (B, C, D),(E)); ((B, C, D), (A),(E)); ((B, D, C), (A),(E)); ((C, B, D), (A),(E)); ((C, D, B), (A),(E)); ((D, B, C), (A),(E)); ((D, C, B), (A),(E)); [[[['A', 'D'], ['B', 'C']], ['E']]] (((B, C), (A, D)),(E)); [[['A', 'B', 'C'], ['D'], ['E']]] ((D), (A, B, C),(E)); ((A, B, C), (D),(E)); ((A, C, B), (D),(E)); ((B, A, C), (D),(E)); ((B, C, A), (D),(E)); ((C, A, B), (D),(E)); ((C, B, A), (D),(E)); [[['A', 'C'], ['B', 'D'], ['E']]] ((A, C), (B, D),(E)); ((B, D), (A, C),(E)); ((C, A), (B, D),(E)); ((D, B), (A, C),(E)); [[['A', 'B', 'D'], ['C'], ['E']]] ((C), (A, B, D),(E)); ((A, B, D), (C),(E)); ((A, D, B), (C),(E)); ((B, A, D), (C),(E)); ((B, D, A), (C),(E)); ((D, A, B), (C),(E)); ((D, B, A), (C),(E)); [[[['A'], [['B'], ['C', 'D']]], ['E']]] ((A, (B, (C, D))),(E)); [[[[['A', 'B'], ['C']], ['D']], ['E']]] ((((A, B), C), D),(E)); [[[[['B'], ['A', 'C']], ['D']], ['E']]] (((B, (A, C)), D),(E)); [[[['C'], [['B'], ['A', 'D']]], ['E']]] ((C, (B, (A, D))),(E)); [[[['C'], [['A', 'B'], ['D']]], ['E']]] ((C, ((A, B), D)),(E)); [[[['A'], [['B', 'C'], ['D']]], ['E']]] ((A, ((B, C), D)),(E)); [[[['A', 'B'], ['C', 'D']], ['E']]] (((A, B), (C, D)),(E)); [[[['B'], [['C'], ['A', 'D']]], ['E']]] ((B, (C, (A, D))),(E)); [[[['C'], [['A'], ['B', 'D']]], ['E']]] ((C, (A, (B, D))),(E)); [[['A', 'D'], ['B', 'C'], ['E']]] ((A, D), (B, C),(E)); ((B, C), (A, D),(E)); ((C, B), (A, D),(E)); ((D, A), (B, C),(E));
但仔细检查发现了一些问题。让我们看看newick '(((A, B), (C, D)),(E));和((D, C), (A, B),(E));的例子。在我们的字典d中，它们具有不同的规范密钥，分别是[[[['A', 'B'], ['C', 'D']], ['E']]]和{}。但事实上，这些是重复的树。我们可以通过观察两棵树之间的Robinson-Foulds distance来证实这一点。如果它是零，树是相同的。在
我们使用ete3 toolkit package中的robinson_foulds函数
from ete3 import Tree tree1 = Tree('(((A, B), (C, D)),(E));') tree2 = Tree('((D, C), (A, B),(E));') rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True) print rf # returns 0
好的，那么Robinson Foulds是一个更好的方法来检查newick树的等式，而不是我的标准树方法。让我们将所有newick字符串包装在一个自定义的MyTree对象中，其中相等被定义为Robinson-Foulds距离为零：
class MyTree(Tree): def __init__(self, *args, **kwargs): super(MyTree, self).__init__(*args, **kwargs) def __eq__(self, other): rf = self.robinson_foulds(other, unrooted_trees=True) return not bool(rf[0]) trees = [MyTree(newick) for newick in all_newicks]
如果我们还可以定义一个__hash__()函数，为重复树返回相同的值，那么set(trees)将自动删除所有重复项。在
不幸的是，I haven't been able to find a good way to define ^{}，但是有了__eq__，我可以make use of ^{}：
unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])] unique_newicks = [tree.write(format=9) for tree in unique_trees] for unique_newick in unique_newicks: print unique_newick
所以，我们的旅程到此结束了。我不能完全证明这是正确的解决方案，但我很有信心，以下19个newick都是可能的不同排列：
((A),(B,C,D),(E)); ((B),(A,C,D),(E)); ((C),(A,B,D),(E)); ((D),(A,B,C),(E)); ((A,B),(C,D),(E)); ((A,C),(B,D),(E)); ((A,D),(B,C),(E)); ((A,(B,(C,D))),(E)); ((B,(A,(C,D))),(E)); ((B,((A,C),D)),(E)); ((B,(C,(A,D))),(E)); ((A,((B,C),D)),(E)); (((A,(B,C)),D),(E)); ((((A,B),C),D),(E)); (((B,(A,C)),D),(E)); ((A,(C,(B,D))),(E)); ((C,(A,(B,D))),(E)); ((C,((A,B),D)),(E)); ((C,(B,(A,D))),(E));
如果我们将每个newick与所有其他newick成对比较，就可以确认这个列表中不再有重复项了
from itertools import product for n1, n2 in product(unique_newicks, repeat=2): if n1 != n2: mt1 = MyTree(n1) mt2 = MyTree(n2) assert mt1 != mt2

作为叶子集合的递归集合的树

实现`trees`函数

生成集合的划分

`trees`函数

相关问题更多 >

编程相关推荐

热门问题

热门文章