Python：如何在输出中输出正确的染色体名称？

chr4:154742507-154742714 CCCAGGCTGG 151 AGTCTTGCTTTTTTTGTCGTTGCCCAGGCTGGAGTGCAGTGGCACCATCTCGGCTCAC chr9:47303792-47303999 CCAGCCTGGG 1 TCCAGCCTGGGTGACAGCGTGAGGCTCTTGTCTCAAATAGAAAAAAAACAAAGAACAAAAAACAAAAAACCACCA

import re # regular expressions, not needed (alternatives: the `split` method) but convenient result = [] output_file=open('output.bed','w') with open('Input.txt') as f: for line in f: if line.startswith('chr'): label = line.strip() elif line[0] == ' ': # short sequence length = len(line.strip()) # find the index of the beginning of the short sequence for i, c in enumerate(line): if c.isalpha(): short_index = i break elif line[0].isdigit(): # long sequence n = line.split(' ')[0] # find the index of the beginning of the long sequence for i, c in enumerate(line): if c.isalpha(): long_index = i break start = int(n) + short_index - long_index start -= 1 end = start + length result.append('{} {} {}'.format(label, start, end)) offset, n, start, length = 0, 0, 0, 0 output_line= "\n".join(result) output_file.write(output_line) output_file.close() output_file=open('last_output.bed','w') with open('output.bed') as fin: for line in fin: start, _, offset_start, offset_end = re.search(r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups() output_line=('chr1\t{}\t{}\n'.format(int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1)) output_file.write(output_line) output_file.close()

1条回答

网友
1楼 · 发布于 2024-04-24 08:53:29

如果我正确理解了这个问题，那么您遇到的问题只与错误输出的染色体数（chr##）有关。你知道吗
这似乎有点明显。在代码末尾，您将对其进行硬编码：
output_line=('chr1\t{}\t{}\n'.format(stuff))
如果您不希望输出总是显示chr1，那么您需要更改它。你知道吗
上一行的正则表达式似乎与文件中的染色体号匹配，只是没有将其捕获到一个组中，以便以后使用。尝试：
chromosome, start, _, offset_start, offset_end = re.search(r'([^:]*):(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups() output_line=('{}\t{}\t{}\n'.format(chromosome, int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1))
这仍然是相当丑陋，但应该工作。请注意，如果您从初始循环中获得正确的输出，而不是写出中间格式然后需要重新分析它，则会容易得多。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章