<p>您可以使用<a href="https://github.com/biocore-ntnu/pyranges" rel="nofollow noreferrer">pyranges</a>库来解析gtf/gff,然后将属性列中的每个条目作为一个单独的列。在</p>
<p>安装说明:</p>
<pre><code># pip install pyranges
# or
# conda install -c bioconda pyranges
</code></pre>
<p>示例文件:</p>
^{pr2}$
<p>使用吡喃:</p>
<pre><code>import pyranges as pr
# as PyRanges-object
gr = pr.read_gtf("ensembl.gtf")
# + + + + -+ -+ + + + -+ + -+ + + -+ + -+
# | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_version | gene_name | gene_source | gene_biotype | transcript_id | transcript_version | +13 |
# | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | ... |
# | + + + -+ -+ + + + -+ + -+ + + -+ + -|
# | 1 | havana | gene | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | nan | nan | ... |
# | 1 | havana | transcript | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 11869 | 12227 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 12613 | 12721 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 1 | ensembl | transcript | 120725 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 133374 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 129055 | 129223 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 120874 | 120932 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# + + + + -+ -+ + + + -+ + -+ + + -+ + -+
# Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
# 13 hidden columns: transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, exon_number, exon_id, exon_version, (assigned, previous, ccds_id, protein_id, protein_version
# as DataFrame
df = gr.df
# Chromosome Source Feature Start End Score Strand Frame gene_id gene_version gene_name ... transcript_biotype tag transcript_support_level exon_number exon_id exon_version (assigned previous ccds_id protein_id protein_version
# 0 1 havana gene 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 1 1 havana transcript 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 NaN NaN NaN NaN NaN NaN NaN NaN
# 2 1 havana exon 11869 12227 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 1 ENSE00002234944 1 NaN NaN NaN NaN NaN
# 3 1 havana exon 12613 12721 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 2 ENSE00003582793 1 NaN NaN NaN NaN NaN
# 4 1 havana exon 13221 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 3 ENSE00002312635 1 NaN NaN NaN NaN NaN
# .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
# 90 1 havana exon 110953 111357 . - . ENSG00000238009 6 AL627309.1 ... lincRNA NaN 5 3 ENSE00001879696 1 NaN NaN NaN NaN NaN
# 91 1 ensembl transcript 120725 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 NaN NaN NaN NaN NaN NaN NaN NaN
# 92 1 ensembl exon 133374 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 1 ENSE00003748456 1 NaN NaN NaN NaN NaN
# 93 1 ensembl exon 129055 129223 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 2 ENSE00003734824 1 NaN NaN NaN NaN NaN
# 94 1 ensembl exon 120874 120932 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 3 ENSE00003740919 1 NaN NaN NaN NaN NaN
#
# [95 rows x 28 columns]
</code></pre>