Python迭代:通过.txt文件进行排序

2024-06-08 18:15:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有样品输入文件.txt地址:

chr1    34870071    34899867    pi-Fam168b.1    -
chr11   98724946    98764609    pi-Wipf2.1  +
chr11   105898192   105920636   pi-Dcaf7.1  +
chr11   120486441   120495268   pi-Mafg.1   -
chr12   3891106 3914443 pi-Dnmt3a.1 +
chr12   82815946    82882157    pi-Map3k9.1 -
chr13   23855536    23856215    pi-Hist1h1a.1   +
chr13   55206682    55236190    pi-Zfp346.1 +
chr1    95700553    95718679    pi-Ing5.1   +
chr13   55313417    55419685    pi-Nsd1.1   +
chr14   27852218    27920472    pi-Il17rd.1 +
chr14   65430438    65568699    pi-Hmbox1.1 -
chr1    120524521   120581739   pi-Tfcp2l1.1    +
chr15   81633147    81657289    pi-Tef.1    +
chr15   89331804    89390691    pi-Shank3.1 +
chr15   103021983   103070259   pi-Cbx5.1   -
chr16   16896549    16927451    pi-Ppm1f.1  +
chr16   17233679    17263523    pi-Hic2.1   +
chr16   17452059    17486929    pi-Crkl.1   +
chr16   24393531    24992661    pi-Lpp.1    +
chr16   43964878    43979143    pi-Zdhhc23.1    -
chr17   25098236    25152532    pi-Cramp1l.1    -
chr17   27993451    28036985    pi-Uhrf1bp1.1   +
chr17   83973363    84031786    pi-Kcng3.1  -
chr1    133904194   133928161   pi-Elk4.1   +
chr18   60844148    60908308    pi-Ndst1.1  -
chr19   10057193    10059582    pi-Fth1.1   +
chr19   44637337    44650762    pi-Hif1an.1 +
chr1    135027714   135036359   pi-Ppp1r15b.1   +
chr2    28677821    28695861    pi-Gtf3c4.1 -
chr1    136651241   136852527   pi-Ppp1r12b.1   -
chr2    154262219   154365092   pi-Cbfa2t2.1    +
chr2    156022393   156135687   pi-Phf20.1  +
chr3    51028854    51055547    pi-Ccrn4l.1 +
chr3    94985683    95021902    pi-Gabpb2.1 -
chr1    158488203   158579750   pi-Abl2.1   +
chr4    45411294    45421633    pi-Mcart1.1 -
chr4    56879897    56960355    pi-D730040F13Rik.1  -
chr4    59818521    59917612    pi-Snx30.1  +
chr4    107847846   107890527   pi-Zyg11a.1 -
chr4    107900359   107973695   pi-Zyg11b.1 -
chr4    132195002   132280676   pi-Eya3.1   +
chr4    134968222   134989706   pi-Rcan3.1  -
chr4    136025678   136110697   pi-Luzp1.1  +
chr1    162933052   162964958   pi-Zbtb37.1 -
chr5    38591490    38611628    pi-Zbtb49.1 -
chr5    67783388    67819359    pi-Bend4.1  -
chr5    114387108   114443767   pi-Ssh1.1   -
chr5    115592990   115608225   pi-Mlec.1   -
chr5    143628624   143656891   pi-Fbxl18.1 -
chr1    172123561   172145541   pi-Uhmk1.1  -
chr6    83312367    83391602    pi-Tet3.1   -
chr6    85419571    85434653    pi-Fbxo41.1 -
chr6    116288039   116359551   pi-March08.1    +
chr6    120786229   120842859   pi-Bcl2l13.1    +
chr7    71031236    71083761    pi-Klf13.1  -
chr7    107068766   107128968   pi-Rnf169.1 -
chr7    139903770   140044311   pi-Fam53b.1 -
chr8    72285224    72298794    pi-Zfp866.1 -
chr8    106872110   106919708   pi-Cmtm4.1  -
chr8    112250549   112261649   pi-Atxn1l.1 -
chr10   41901651    41911816    pi-Foxo3.1  -
chr8    119682164   119739895   pi-Gan.1    +
chr8    125406988   125566154   pi-Ankrd11.1    -
chr9    27148219    27165314    pi-Igsf9b.1 +
chr9    44100521    44113717    pi-Hinfp.1  -
chr9    61761092    61762348    pi-Rplp1.1  -
chr9    106590412   106691503   pi-Rad54l2.1    -
chr9    114416339   114473487   pi-Trim71.1 -
chr9    119311403   119351032   pi-Acvr2b.1 +
chr9    119354082   119373348   pi-Exog.1   +
chr10   82822985    82831579    pi-D10Wsu102e.1 +
chr10   126415753   126437016   pi-Ctdsp2.1 +
chr1    90159688    90174093    pi-Hjurp.1  -
chr11   60591039    60597792    pi-Smcr8.1  +
chr11   69209318    69210176    pi-Lsmd1.1  +
chr11   75345218    75391069    pi-Slc43a2.1    +
chr11   79474214    79511524    pi-Rab11fip4.1  +
chr11   95818479    95868022    pi-Igf2bp1.1    -
chr11   97223641    97259855    pi-Socs7.1  +
chr11   97524530    97546757    pi-Mllt6.1  +
chr1    120355721   120355843   1-qE2.3-2.1 -
chr2    120518324   120540873   2-qE5-4.1   +
chr7    82913927    82926993    7-qD2-40.1  -

第1列=染色体数目

Column2=开始

第3列=结束

Column4=基因名称

第5列=方向(或+或-)

1.)我需要提取染色体数目相同的品系(第1列),它们的起始位点最大相差200个(因此不超过200个)(第2列),它们的方向相反(一个是正/负)。你知道吗

到目前为止,我还不确定我的错误在哪里:

import csv
import itertools as it
f=open('inputfile.txt', 'r')

def getrecords(f):
    for line in open(f):
        yield line.strip().split()
key=lambda x: x[0]
for i, rec in it.groupby(sorted(getrecords('inputfile.txt'), key=key), key=key):
    for c0, c1 in it.combinations(rec, 2):
        if (c0[4]!= c1[4] and (abs(int(c0[1])-int(c1[1]))) < 200):
            print ("%s\t%s\t%s" % (c0[0], c0[1], c0[3]))
            print("%s\t%s\t%s" % (c1[0], c1[1], c1[3]))

请注意:这段代码运行,但不会给出任何输出,当我确定应该有什么 我预计将有大约15个独特的序列线。 预期产量:

ChrX   start_number1            gene_name1
ChrX   start_number1+/-200      gene_name2
ChrY   start_number2            gene_name3
ChrY   start_number2+/-200      gene_name4

然后我将这些行排序以除去重复项。你知道吗


Tags: keypistartchr1c1c0chr5chr6
1条回答
网友
1楼 · 发布于 2024-06-08 18:15:20

示例中没有满足指定条件的值,因此我在inputfile.txt中添加了一行:

chr1    34870091    34899887    pi-Fam168b.1 +

我复制了inputfile.txt的第一行,并将20添加到第二列和第三列的整数中。你知道吗

首先,您不需要导入csv,您不会使用它。你应该导入^{}^{}^{},我将在下面解释。你知道吗

from itertools import groupby,product
from operator import itemgetter

此块只是将inputfile.txt解析为可用的数据结构(字典列表),其中文件中的每个记录都是sites列表中的dictionary元素。你知道吗

with open('/home/kevin/inputfile.txt', 'rb') as f: # should use with open()
    sites = []  #list to hold each record as a dictionary
    for row in f:
        row = tuple(row.strip().split())
        d = {'chr': row[0], 'start': row[1], 'stop':row[2], 'gene_name':row[3], 'strand':row[4]}
        sites.append(d)

我选择首先,使用itemgetter串排序,现在,当您groupby串时,我们可以将字典分为所有plus串的列表和所有minus串的列表:

plus = []
minus = []

for elmt,grp in groupby(sites, itemgetter('strand')): # sites is our sorted list of dicts
    for item in grp:
        if elmt == '+':
            plus.append(item)
        else:
            minus.append(item)

现在您可以使用product遍历plusminus,这就像嵌套的for循环,并比较start位置:

for p,m in product(plus,minus):
    if p['chr'] == m['chr'] and abs(int(p['start']) - int(m['start'])) < 200:
            print ("%s\t%s\t%s") % (p['chr'], p['start'], p['gene_name'])
            print ("%s\t%s\t%s") % (m['chr'], m['start'], m['gene_name'])

结果是:

chr1    34870091    pi-Fam168b.1 #remember I artificially added this one
chr1    34870071    pi-Fam168b.1

作为参考,这种类型的任务可以在python库pandas中更优雅地实现。{a5}(C++ ++ I)是专门设计用来处理^ {CD19}文件的,这是你使用的格式。嗯!你知道吗

相关问题 更多 >