Python中的模式替换

2024-05-15 23:34:20 发布

您现在位置:Python中文网/ 问答频道 /正文

寻找一些替代方法来清理包含括号之间信息的表格文件。 这将是在管道中包含的第一步,我需要删除括号内的每个值(包括括号)。你知道吗

我所拥有的

> Otu00467  Bacteria(100);Gracilibacteria(99);unclassified(99);unclassified(99);unclassified(99);unclassified(99);
> Otu00469  Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470  Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);

我想要的是:

 Otu00467   Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00469   Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00470   Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;

我的第一种方法是用“;”,“(”,“)”分隔第二列,然后进一步连接所有内容。不错,但太难看了。你知道吗

谢谢你。你知道吗


Tags: 方法信息表格括号bacteriaunclassifiedproteobacteriaotu00470
3条回答

我想试试regexp。类似于:

pattern = re.compile('(\w+)\(\d+\);')
';'.join(re.findall(pattern, string))

对于每个字符串

这个正则表达式去掉了带圆括号的数字组,也去掉了任何'>'字符,因为看起来您也希望消除它们。你知道吗

import re

data = '''\
> Otu00467  Bacteria(100);Gracilibacteria(99);unclassified(99);>unclassified(99);unclassified(99);unclassified(99);
> Otu00469  Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470  Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
'''

data = re.sub(r'>|\(\d+\)', '', data)
print(data)

输出

 Otu00467  Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00469  Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00470  Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;

这段代码适用于python2&3。你知道吗

import re
new_string = re.sub(r'\(.*?\)', '', your_string)

相关问题 更多 >