在特定列中拆分信息？

Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score 1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT-TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036 1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG-AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788 1 DM CM1210274 null null chr22:43089327:- NM_017436.4 NP_059132.1:p.Q211E A4GALT NORpolyagglutination null CTGCGGAACCTGACCAACGTGCTGGGCACC[C/G]AGTCCCGCTACGTCCTCAACGGCGCGTTCC CAG-GAG 211 null null null null Suchanowska JBC 287 38220 2012 22965229 53947 0.79 TOLERATED null

2条回答

网友

1楼 · 编辑于 2024-05-23 21:01:19

假设您所做的实际上是解析/格式化csv文件，那么韦恩·沃纳的csv模块方法可能是解决这个问题最有效的方法。你知道吗

或者，您可以考虑使用re模块中的re.sub。要使用的确切正则表达式将取决于数据。例如，如果该列始终是3个核苷酸，-和3个核苷酸，则类似的操作可能有效：

re.sub(r'(?<=[ACTG]{3})-(?=[ACTG]{3})', '\t', line))

regex使用lookbehind和lookahead来替换两组3个核苷酸之间的-，因此假设这种模式不会出现在文件的其他地方，应该可以很好地工作。你知道吗

编辑：由于某种原因更改为re.sub，原来的代码让我陷入了split的思维模式！你知道吗

网友

2楼 · 编辑于 2024-05-23 21:01:19

如果你知道它总是在第13列，就用一个切片：

'{}\t{}'.format(line[:13], line[14:])

或者，如果你总是知道这将是你可以限制分裂的第一件事：

>>> x = 'this has - a few - dashes - in it'
>>> x.split('-', maxsplit=1)
['this has ', ' a few - dashes - in it']

如果“列”的意思是您的数据是一个csv文件（制表符分隔的文件的工作方式相同），那么Python的csv模块将帮助您：

with open('infile.txt') as f, open('outfile.txt', 'w') as outfile: 
    reader = csv.reader(f, delimiter='\t')                                         
    writer = csv.writer(outfile, delimiter='\t')                                   
    writer.writerow(next(reader, None))  # Write out the header row                
    for row in reader:   
        # Note: Python lists begin with [0], 
        #       so the 13th column will have an index of 12                                                          
        row[12] = row[12].replace('-', ' ')                                        
        writer.writerow(row)

相关问题更多 >

编程相关推荐

热门问题

热门文章