pandas datafram中的替换字符串

>>gprs.head() Out[362]: Rxn rule 0 13DAMPPOX HGNC:549 or HGNC:550 or HGNC:80 6 24_25VITD2Hm HGNC:2602 8 25VITD2Hm HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251) or (HGNC:250 and HGNC:251) or HGNC:252 or HGNC:253 or HGNC:255 or HGNC:256 ...

for index, row in gprs.iterrows(): row['rule']=row['rule'].replace(r'(', "") row['rule']=row['rule'].replace(r')', "") ruleGenes=re.split(" and | or ",(row['rule'])) for gene in ruleGenes: if re.match("HGNC:HGNC:", gene): gene=gene[5:] try: gprs=gprs.replace(gene,translation[gene]) except: print 'error in ', gene else: try: gprs=gprs.replace(gene,translation[gene]) except: print 'error in ', gene

>>gprs.head() 0 13DAMPPOX HGNC:549 or HGNC:550 or HGNC:80 6 24_25VITD2Hm 0 7 24_25VITD3Hm HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251) or (HGNC:250 and HGNC:251) or HGNC:252 or HGNC:253 or HGNC:255 or HGNC:256

2条回答

网友

1楼 · 编辑于 2024-04-27 13:29:44

这里有一种方法可以用来计算这种酶是否被表达。在

代码：

import re
RE_GENE_NAME = re.compile(r'(HGNC:[0-9]+)')

def calc_expressed(translation_table, rule_str):
    rule_expr = RE_GENE_NAME.sub(r'translation_table["\1"]', rule_str)
    return eval(rule_expr)

它是如何工作的？

这里的想法是采用如下规则：

^{pr2}$

把它改成：

translation_table["HGNC:253"] or translation_table["HGNC:549"]

IE:将HGNC:1234等值的所有实例更改为translation_table["HGNC:1234"]。在

这将产生一个字符串，这是一个合法的python表达式。结果表达式可以用eval()计算。在

测试代码：

translation = {
    'HGNC:80': 1,
    'HGNC:249': 1,
    'HGNC:250': 1,
    'HGNC:251': 0,
    'HGNC:252': 1,
    'HGNC:253': 0,
    'HGNC:255': 1,
    'HGNC:256': 1,
    'HGNC:549': 0,
    'HGNC:550': 1,
    'HGNC:2602': 0,
    'HGNC:16354': 1,
}

test_rules = (
    ('HGNC:550', 1),
    ('HGNC:2602', 0),
    ('HGNC:253 or HGNC:549', 0),
    ('HGNC:549 or HGNC:550 or HGNC:80', 1),
    ('HGNC:549 or (HGNC:550 and HGNC:2602)', 0),
    ('HGNC:549 or (HGNC:550 and HGNC:16354)', 1),
    ('HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251)', 1)
)

for rule, expected in test_rules:
    assert expected == calc_expressed(translation, rule)

网友

2楼 · 编辑于 2024-04-27 13:29:44

输入翻译可以用

>>>for_eval = {k+'(?![0-9])': str(v) for k, v in translation.items()}
>>>gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

说明：

第一行

^{pr2}$

将0和{}分别交换其字符串形式'0'和{}，为将它们插入第二行的字符串做准备。添加'（？！[0-9]），则检查并忽略后面有更多数字的匹配项，从而避免只在键的第一部分匹配。在

第二条线

>>>gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

在pandas中，将替换作为列操作执行，而不是在python中迭代每一行，对于更大的数据集（本例中为30个或更多个条目），速度要慢得多。在

如果没有regex=True，这只会在完全匹配的情况下起作用，这将导致您在尝试实现较长规则时遇到的相同问题。在

例如，将测试用例归功于u/Stephen Rauch：

In [3]:translation = {
    'HGNC:80': 1,
    'HGNC:249': 1,
    'HGNC:250': 1,
    'HGNC:251': 0,
    'HGNC:252': 1,
    'HGNC:253': 0,
    'HGNC:255': 1,
    'HGNC:256': 1,
    'HGNC:549': 0,
    'HGNC:550': 1,
    'HGNC:2602': 0,
    'HGNC:16354': 1,
}

In [4]:gprs = pd.DataFrame([
    ('HGNC:550', 1),
    ('HGNC:2602', 0),
    ('HGNC:253 or HGNC:549', 0),
    ('HGNC:549 or HGNC:550 or HGNC:80', 1),
    ('HGNC:549 or (HGNC:550 and HGNC:2602)', 0),
    ('HGNC:549 or (HGNC:550 and HGNC:16354)', 1),
    ('HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251)', 1)
], columns = ['rule', 'target'])

In [5]:for_eval = {k: str(v) for k, v in translation.items()}

In [6]:gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

In [7]:gprs['translation']

Out[7]:
0                              1
1                              0
2                         0 or 0
3                    0 or 1 or 1
4                 0 or (1 and 0)
5                 0 or (1 and 1)
6    1 or (1 and 1) or (1 and 0)
Name: translation, dtype: object

对于后面要看的第二部分，eval，正如u/Stephen Rauch的答案中所提到和阐述的，可以用来计算所生成的字符串中包含的表达式。为此，pd.Series.map可用于对序列应用元素操作，比使用iterrows更快。这里，看起来像这样

In [10]:gprs['translation'].map(eval)
Out[10]: 
0    1
1    0
2    0
3    1
4    0
5    1
6    1
Name: translation, dtype: int64

或者，如果试图弥补性能的最后一点，可以选择在输出上使用regex模式匹配而不是map。它变得更加依赖于你的规则是如何措辞的，但是如果它们的格式都像你帖子里的三个一样，都是成对的、带圆括号的，没有嵌套，那么

# set any 'and' term with a zero in it to zero
>>>ands = gprs['translation'].str.replace('0 and \d|\d and 0', '0')
# if any ones remain, only 'or's and '1 and 1' statements are left
>>>ors = ands.replace('1', 1, regex=True)
# faster to force it to numeric than to search the remaining terms for zeros
>>>out = pd.to_numeric(ors, errors='coerce').fillna(0)
>>>out
0    1.0
1    0.0
2    0.0
3    1.0
4    0.0
5    1.0
6    1.0
Name: translation, dtype: float64

应该快5倍左右，使用timeit模块检查，超过几千行，盈亏平衡点在60或70个条目左右。在

代码：

测试代码：

相关问题更多 >

编程相关推荐

热门问题

热门文章