用python提取CSV格式的BLAST输出列

# BLASTN 2.2.29+ # Query: Cryptocephalus androgyne # Database: SANdouble # Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score # 1 hits found Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444 # BLASTN 2.2.29+ # Query: Cryptocephalus aureolus # Database: SANdouble # Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score # 4 hits found Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051 Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780 Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532 Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340

2条回答

网友

1楼 · 编辑于 2024-06-17 10:47:08

我设法找到了一种基于：

Python: split files using mutliple split delimiters

import csv

csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)


identity = []

for line in reader:
    identity.append(line[2])

print identity

网友

2楼 · 编辑于 2024-06-17 10:47:08

数据文件不像CSV格式。它有注释，并且它的分隔符不是单个字符，而是格式化的空格。在

因为你的总目标是

to count all the matches which have over 98% identity (the third column).

并且数据文件内容格式良好，可以使用普通的文件解析方法：

import re

with open('BLASToutput.csv') as f:
    # read the file line by line
    for line in f:
        # skip comments (or maybe leave as it is)
        if line.startswith('#'):
            # print line
            continue
        # split fields
        fields = re.split(r' +', line)
        # check if the 3rd field is greater than 98%
        if float(fields[2]) > 98:
            # output the matched line
            print line

相关问题更多 >

编程相关推荐

热门问题

热门文章