如何为给定外群的一组物种生成所有可能的Newick树排列?

2024-05-14 06:46:58 发布

您现在位置:Python中文网/ 问答频道 /正文

如何为给定外群的一组物种生成所有可能的Newick树排列?在

对于那些不知道Newick树格式是什么的人,可以在以下网站上找到一个很好的描述: https://en.wikipedia.org/wiki/Newick_format

我想为给定一个外群的一组物种创建所有可能的Newick树排列。我希望处理的叶节点数很可能是4、5或6个叶节点。在

“软”和“硬”多原子都是允许的。 https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomieshttps://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy

下面显示的是理想的输出,将“E”设置为outgroup

理想输出:

((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));

但是,我使用itertools提供的任何可能的解决方案,特别是itertools.排列,遇到了等效输出的问题。我想到的最后一个想法涉及到如下所示的等效输出。在

等效输出:

^{pr2}$

这是我的解决方案的开始。不过,除了itertools之外,我现在还不知道该怎么解决这个问题。在

import itertools

def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
    permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))

    for given_permutation in permutations_list:
        process(given_permutation)

Newick_Permutation_Generator(["A","B","C","D","E"], "E")

Tags: ofhttpsorg节点物种wikiwikipedialist
2条回答

作为叶子集合的递归集合的树

让我们暂时把newick表示放在一边,考虑一下这个问题的一个可能的python表示。在

根树可以看作是(set of(sets of(sets of…))叶的递归层次结构。集合是无序的,这非常适合于描述树中的类:{{{"A", "B"}, {"C", "D"}}, "E"}应该与{}相同。在

如果我们考虑叶的初始集{"A", "B", "C", "D", "E"},那么以“E”为外群的树是形式为{tree, "E"}的集合,其中tree来自可以从叶子集{}构建的树集。我们可以尝试编写一个递归的trees函数来生成这组树,我们的全部树集将表示如下:

{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}

(这里,我使用set comprehension表示法。)

实际上,python不允许集合集合,因为集合的元素必须是“散列的”(也就是说,python必须能够计算对象的一些“散列”值,以便能够检查它们是否属于集合)。python集合没有这个属性。幸运的是,我们可以使用一个类似的名为^{}的数据结构,它的行为非常类似于一个集合,但是不能被修改并且是“散列”的。因此,我们的一组树将是:

^{pr2}$

实现trees函数

现在让我们关注trees函数。在

对于叶集的每个可能的分区(分解为一组不相交的子集,包括所有元素),我们需要为分区的每个部分找到所有可能的树(通过递归调用)。对于一个给定的分区,我们将为每个可能的子树组合生成一个树。在

例如,如果一个分区是{"A", {"B", "C", "D"}},我们将考虑所有可以由部分"A"生成的树(实际上,只是叶"A"本身),以及可以由部分{"B", "C", "D"}(即trees({"B", "C", "D"}))生成的所有可能的树。然后,通过获取所有可能的对,其中一个元素来自"A",另一个元素来自trees({"B", "C", "D"}),就可以得到这个分区的可能树。在

这可以推广到包含两个以上部分的分区,并且itertools中的product函数在这里似乎很有用。在

因此,我们需要一种方法来生成一组叶子的可能分区。在

生成集合的划分

在这里,我根据this solution创建了一个partitions_of_set函数:

# According to https://stackoverflow.com/a/30134039/1878788:
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
    if len(s) == 1:
        yield frozenset(s)
        return
    # Extract one element from the set
    # https://stackoverflow.com/a/43804050/1878788
    elem, *_ = s
    rest = frozenset(s - {elem})
    for partition in partitions_of_set(rest):
        for subset in partition:
            # Insert the element in the subset
            try:
                augmented_subset = frozenset(subset | frozenset({elem}))
            except TypeError:
                # subset is actually an atomic element
                augmented_subset = frozenset({subset} | frozenset({elem}))
            yield frozenset({augmented_subset}) | (partition - {subset})
        # Case with the element in its own extra subset
        yield frozenset({elem}) | partition

为了检查获得的分区,我们创建了一个函数,使它们更易于显示(这对于以后生成树的newick表示也很有用):

def print_set(f):
    if type(f) not in (set, frozenset):
        return str(f)
    return "(" + ",".join(sorted(map(print_set, f))) + ")"

我们测试分区是否有效:

for partition in partitions_of_set({"A", "B", "C", "D"}):
    print(len(partition), print_set(partition))

输出:

1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)

trees函数

的实际代码

现在我们可以编写tree函数:

from itertools import product
def trees(leaves):
    if type(leaves) not in (set, frozenset):
        # It actually is a single leaf
        yield leaves
        # Don't try to yield any more trees
        return
    # Otherwise, we will have to consider all the possible
    # partitions of the set of leaves, and for each partition,
    # construct the possible trees for each part
    for partition in partitions_of_set(leaves):
        # We need to skip the case where the partition
        # has only one subset (the initial set itself),
        # otherwise we will try to build an infinite
        # succession of nodes with just one subtree
        if len(partition) == 1:
            part, *_ = partition
            # Just to be sure the assumption is correct
            assert part == leaves
            continue
        # We recursively apply *tree* to each part
        # and obtain the possible trees by making
        # the product of the sets of possible subtrees.
        for subtree in product(*map(trees, partition)):
            # Using a frozenset guarantees
            # that there will be no duplicates
            yield frozenset(subtree)

测试:

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

for tree in all_trees:
    print(print_set(tree) + ";")

输出:

(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);

我希望结果是正确的。在

这种方法有点棘手。我花了一些时间来找出如何避免无限递归(当分区是{{"A", "B", "C", "D"}}时就会发生这种情况)。在

这是个很难回答的问题!这是我的旅程。在

第一个观察是outgroup始终是一个连接到newick字符串末尾的节点。让我们把其余的物种称为内群,并尝试生成这些物种的所有排列。然后简单地添加outgroup。在

from itertools import permutations

def ingroup_generator(species, n):
    for perm in permutations(species, n):
        yield tuple([tuple(perm), tuple(s for s in species if s not in perm)])

def format_newick(s, outgroup=''):
    return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup)

species = ["A","B","C","D","E"]
outgroup = "E"
ingroup = [s for s in species if s != outgroup]

itertools_newicks= []
for n in range(1, len(ingroup)):
    for p in ingroup_generator(ingroup, n):
        itertools_newicks.append(format_newick(p, outgroup))

for newick in itertools_newicks:
    print newick

这将返回40个newick字符串:

^{pr2}$

其中有些是重复的,但我们稍后将删除这些重复项。在

作为bli noted in the comments(((("A","B"),"C"),"D"),("E"));及其变体也应被视为有效解。 comments on BioStar向我指出了正确的方向,这与生成二叉树的所有可能的分组是一样的。我找到了一个漂亮的Python implementation in this StackOverflow answer by rici

# A very simple representation for Nodes. Leaves are anything which is not a Node.
class Node(object):
  def __init__(self, left, right):
    self.left = left
    self.right = right

  def __repr__(self):
    return '(%s, %s)' % (self.left, self.right)

# Given a tree and a label, yields every possible augmentation of the tree by
# adding a new node with the label as a child "above" some existing Node or Leaf.
def add_leaf(tree, label):
  yield Node(label, tree)
  if isinstance(tree, Node):
    for left in add_leaf(tree.left, label):
      yield Node(left, tree.right)
    for right in add_leaf(tree.right, label):
      yield Node(tree.left, right)

# Given a list of labels, yield each rooted, unordered full binary tree with
# the specified labels.
def enum_unordered(labels):
  if len(labels) == 1:
    yield labels[0]
  else:
    for tree in enum_unordered(labels[1:]):
      for new_tree in add_leaf(tree, labels[0]):
        yield new_tree

然后

enum_newicks= []
for t in enum_unordered(ingroup):
    enum_newicks.append('({},({}));'.format(t, outgroup))

for newick in enum_newicks:
    print newick

生成以下15个新图标:

((A, (B, (C, D))),(E));
(((A, B), (C, D)),(E));
((B, (A, (C, D))),(E));
((B, ((A, C), D)),(E));
((B, (C, (A, D))),(E));
((A, ((B, C), D)),(E));
(((A, (B, C)), D),(E));
((((A, B), C), D),(E));
(((B, (A, C)), D),(E));
(((B, C), (A, D)),(E));
((A, (C, (B, D))),(E));
(((A, C), (B, D)),(E));
((C, (A, (B, D))),(E));
((C, ((A, B), D)),(E));
((C, (B, (A, D))),(E));

所以现在我们已经有了40+15=55个可能的newick字符串,我们必须删除重复项。在

我尝试的第一个死胡同是为每个newick字符串创建一个规范化的表示,这样我就可以将它们用作字典中的键。其想法是递归地对所有节点中的字符串进行排序。但首先我必须捕获所有(嵌套)节点。我不能使用正则表达式,因为nested structures are by definition not regular。在

因此,我使用了pyparsing包,并提出了这个:

from pyparsing import nestedExpr 

def sort_newick(t):
    if isinstance(t, str):
        return sorted(t)
    else:
        if all(isinstance(c, str) for c in t):
            return sorted(t)
        if all(isinstance(l, list) for l in t):
            return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))]
        else:
            return [sort_newick(l) for l in t]


def canonical_newick(n):
    n = n.replace(',', '')
    p = nestedExpr().parseString(n).asList()
    s = sort_newick(p)
    return str(s)

这让我屈服了

from collections import defaultdict

all_newicks = itertools_newicks + enum_newicks

d = defaultdict(list)
for newick in all_newicks:
    d[canonical_newick(newick)].append(newick)

for canonical, newicks in d.items():
    print canonical
    for newick in newicks:
        print '\t', newick
    print

有22个键的字典:

[[[['A'], [['C'], ['B', 'D']]], ['E']]]
    ((A, (C, (B, D))),(E));

[[[['B'], [['A'], ['C', 'D']]], ['E']]]
    ((B, (A, (C, D))),(E));

[[[['B'], [['A', 'C'], ['D']]], ['E']]]
    ((B, ((A, C), D)),(E));

[[['A', 'C', 'D'], ['B'], ['E']]]
    ((B), (A, C, D),(E));
    ((A, C, D), (B),(E));
    ((A, D, C), (B),(E));
    ((C, A, D), (B),(E));
    ((C, D, A), (B),(E));
    ((D, A, C), (B),(E));
    ((D, C, A), (B),(E));

[[['A', 'B'], ['C', 'D'], ['E']]]
    ((A, B), (C, D),(E));
    ((B, A), (C, D),(E));
    ((C, D), (A, B),(E));
    ((D, C), (A, B),(E));

[[[[['A'], ['B', 'C']], ['D']], ['E']]]
    (((A, (B, C)), D),(E));

[[[['A', 'C'], ['B', 'D']], ['E']]]
    (((A, C), (B, D)),(E));

[[['A'], ['B', 'C', 'D'], ['E']]]
    ((A), (B, C, D),(E));
    ((B, C, D), (A),(E));
    ((B, D, C), (A),(E));
    ((C, B, D), (A),(E));
    ((C, D, B), (A),(E));
    ((D, B, C), (A),(E));
    ((D, C, B), (A),(E));

[[[['A', 'D'], ['B', 'C']], ['E']]]
    (((B, C), (A, D)),(E));

[[['A', 'B', 'C'], ['D'], ['E']]]
    ((D), (A, B, C),(E));
    ((A, B, C), (D),(E));
    ((A, C, B), (D),(E));
    ((B, A, C), (D),(E));
    ((B, C, A), (D),(E));
    ((C, A, B), (D),(E));
    ((C, B, A), (D),(E));

[[['A', 'C'], ['B', 'D'], ['E']]]
    ((A, C), (B, D),(E));
    ((B, D), (A, C),(E));
    ((C, A), (B, D),(E));
    ((D, B), (A, C),(E));

[[['A', 'B', 'D'], ['C'], ['E']]]
    ((C), (A, B, D),(E));
    ((A, B, D), (C),(E));
    ((A, D, B), (C),(E));
    ((B, A, D), (C),(E));
    ((B, D, A), (C),(E));
    ((D, A, B), (C),(E));
    ((D, B, A), (C),(E));

[[[['A'], [['B'], ['C', 'D']]], ['E']]]
    ((A, (B, (C, D))),(E));

[[[[['A', 'B'], ['C']], ['D']], ['E']]]
    ((((A, B), C), D),(E));

[[[[['B'], ['A', 'C']], ['D']], ['E']]]
    (((B, (A, C)), D),(E));

[[[['C'], [['B'], ['A', 'D']]], ['E']]]
    ((C, (B, (A, D))),(E));

[[[['C'], [['A', 'B'], ['D']]], ['E']]]
    ((C, ((A, B), D)),(E));

[[[['A'], [['B', 'C'], ['D']]], ['E']]]
    ((A, ((B, C), D)),(E));

[[[['A', 'B'], ['C', 'D']], ['E']]]
    (((A, B), (C, D)),(E));

[[[['B'], [['C'], ['A', 'D']]], ['E']]]
    ((B, (C, (A, D))),(E));

[[[['C'], [['A'], ['B', 'D']]], ['E']]]
    ((C, (A, (B, D))),(E));

[[['A', 'D'], ['B', 'C'], ['E']]]
    ((A, D), (B, C),(E));
    ((B, C), (A, D),(E));
    ((C, B), (A, D),(E));
    ((D, A), (B, C),(E));

但仔细检查发现了一些问题。让我们看看newick '(((A, B), (C, D)),(E));((D, C), (A, B),(E));的例子。在我们的字典d中,它们具有不同的规范密钥,分别是[[[['A', 'B'], ['C', 'D']], ['E']]]和{}。但事实上,这些是重复的树。我们可以通过观察两棵树之间的Robinson-Foulds distance来证实这一点。如果它是零,树是相同的。在

我们使用ete3 toolkit package中的robinson_foulds函数

from ete3 import Tree

tree1 = Tree('(((A, B), (C, D)),(E));')
tree2 = Tree('((D, C), (A, B),(E));')

rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True)
    print rf # returns 0

好的,那么Robinson Foulds是一个更好的方法来检查newick树的等式,而不是我的标准树方法。让我们将所有newick字符串包装在一个自定义的MyTree对象中,其中相等被定义为Robinson-Foulds距离为零:

class MyTree(Tree):

    def __init__(self, *args, **kwargs):
        super(MyTree, self).__init__(*args, **kwargs)

    def __eq__(self, other):
        rf = self.robinson_foulds(other, unrooted_trees=True)
        return not bool(rf[0])

trees = [MyTree(newick) for newick in all_newicks]

如果我们还可以定义一个__hash__()函数,为重复树返回相同的值,那么set(trees)将自动删除所有重复项。在

不幸的是,I haven't been able to find a good way to define ^{},但是有了__eq__,我可以make use of ^{}

unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])]
unique_newicks = [tree.write(format=9) for tree in unique_trees]
for unique_newick in unique_newicks:
    print unique_newick

所以,我们的旅程到此结束了。我不能完全证明这是正确的解决方案,但我很有信心,以下19个newick都是可能的不同排列:

((A),(B,C,D),(E));
((B),(A,C,D),(E));
((C),(A,B,D),(E));
((D),(A,B,C),(E));
((A,B),(C,D),(E));
((A,C),(B,D),(E));
((A,D),(B,C),(E));
((A,(B,(C,D))),(E));
((B,(A,(C,D))),(E));
((B,((A,C),D)),(E));
((B,(C,(A,D))),(E));
((A,((B,C),D)),(E));
(((A,(B,C)),D),(E));
((((A,B),C),D),(E));
(((B,(A,C)),D),(E));
((A,(C,(B,D))),(E));
((C,(A,(B,D))),(E));
((C,((A,B),D)),(E));
((C,(B,(A,D))),(E));

如果我们将每个newick与所有其他newick成对比较,就可以确认这个列表中不再有重复项了

from itertools import product

for n1, n2 in product(unique_newicks, repeat=2):
    if n1 != n2:
        mt1 = MyTree(n1)
        mt2 = MyTree(n2)
        assert mt1 != mt2

相关问题 更多 >