解析vcfs格式的txt文件

2024-04-26 00:06:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用数据框中的以下字段将信息从txt文件提取到dataframe

1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN 

The txt file is here

我编写了以下代码,试图从文件中获取信息,但不知道如何继续。你能帮我介绍一些做那件事的想法吗

import io
import os
import pandas as pd


def read_vcf(path):
    with open('clinvar_final.txt', 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

Tags: 文件数据ioposimporttxt信息id
1条回答
网友
1楼 · 发布于 2024-04-26 00:06:27

你可以阅读它

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

在这之后,您将有一个tabel,其中包含列2)ID3)POS4)ALT

print(df[['ID', 'POS', 'ALT']].head())

给予

       ID      POS ALT
0  475283  1014O42   A
1  542074  1O14122   T
2  183381  1014143   T
3  542075  1014179   T
4  475278  1014217   T

其他信息(1)GENEINFO5)CLNSIG6)CLNDN)作为一个字符串在列INFO中,您可以使用regex将它们添加到分隔的列中

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

结果

0    ISG15:9636
1    ISG15:9636
2    ISG15:9636
3    ISG15:9636
4    ISG15:9636
Name: GENEINFO, dtype: object

0                    Benign
1    Uncertain_significance
2                Pathogenic
3    Uncertain_significance
4                    Benign
Name: CLNSIG, dtype: object

0    Immunodeficiency_38_with_basal_ganglia_calcifi...
1    Immunodeficiency_38_with_basal_ganglia_calcifi...
2    Immunodeficiency_38_with_basal_ganglia_calcifi...
3    Immunodeficiency_38_with_basal_ganglia_calcifi...
4    Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object

import pandas as pd

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

print(df.columns)

print(df[['ID', 'POS', 'ALT']].head())

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

相关问题 更多 >