如果行满足特定范围,如何打印行

2024-05-13 08:43:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个大文件,如下所示:

f1:

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

f2:

chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025

我想做的是,如果file2的startend列的范围在file1(第2列和第3列)和chr的范围内,则在file1中的单独列中打印file2的行。因此,根据我提供的虚拟示例文件,所需的输出应该是(只有linc1320的范围在文件1的第一行中):

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

我不是专业的编码器,但我一直在使用此代码根据文件2手动更改范围:

awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv

我并不特别喜欢使用特定的编程语言Pythonawk都会很有帮助。谢谢你的帮助


Tags: 文件idtypestartfile1file2endf1
3条回答

编辑:使用OP编辑的输入,可以尝试以下操作。即使文件2中的字段超过4个,也可以这样做

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  match($0,/,.*/)
  val[count]=substr($0,RSTART-1,RLENGTH-1)
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1


有了你们展示的样品,你们能试一下下面的吗。用GNU awk编写和测试,应该可以在任何awk中使用。参考anubhava的回答

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  val[count]=$0
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1

解释:添加上述内容的详细解释

awk '                                  ##Starting awk program from here.
BEGIN{                                 ##Starting BEGIN section of this program from here.
  FS=OFS=","                           ##Setting FS and OFS as comma here. 
}
FNR==NR{                               ##Checking condition which will be true when file2 is being read.
  start[++count]=$2                    ##Creating start array with count variable as as index and has $2 value in it.
  end[count]=$3                        ##Creating end array with count as index and value is $3.
  val[count]=$0                        ##Creating val array with index of count and value as $0.
  next                                 ##next will skip all further statements from here.
}
{
  for(i=1;i<=count;i++){               ##Running for loop till value of count here.
    if(start[i]>$2 && end[i]<$3){      ##Checking condition if start[i]>$2 AND end[i]<$3.
      print $0 OFS val[i]              ##Then printing current line with OFS, val here.
      next                             ##next will skip all further statements from here.
    }
  }
}
1                                      ##1 will print current line here.
' file2 file1                          ##Mentioning Input_file names here.

您可以使用此awk

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

更具可读性的表格:

awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1

让我们尝试以pandas的方式解决问题,首先将csv文件读入pandas数据帧

f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')

>>> f1

      0        1        2        3                     4              5                     6
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1


>>> f2

    chr      name    start      end
0  chr1  linc1320  3073300  3074300
1  chr3  linc2245  3077270  3078250
2  chr1  linc8956  4410501  4406025

现在我们可以mergefilter满足给定区间包含条件的行,然后我们可以join使用文件f1过滤行

m = f1.reset_index()\
      .merge(f2, left_on=0, right_on='chr')\
      .where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))\
      .set_index('index')[['name', 'start', 'end']]

f3 = f1.join(m)

>>> f3

      0        1        2        3                     4              5                     6      name      start        end
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC  linc1320  3073300.0  3074300.0
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA       NaN        NaN        NaN
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1       NaN        NaN        NaN

PS:您还可以使用f3.to_csv('file3.csv')将生成的数据帧f3保存到csv文件

相关问题 更多 >