如果行满足特定范围，如何打印行

3条回答

网友

1楼 · 编辑于 2024-05-13 08:43:51

编辑：使用OP编辑的输入，可以尝试以下操作。即使文件2中的字段超过4个，也可以这样做

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  match($0,/,.*/)
  val[count]=substr($0,RSTART-1,RLENGTH-1)
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1

有了你们展示的样品，你们能试一下下面的吗。用GNU awk编写和测试，应该可以在任何awk中使用。参考anubhava的回答

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  val[count]=$0
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1

解释：添加上述内容的详细解释

awk '                                  ##Starting awk program from here.
BEGIN{                                 ##Starting BEGIN section of this program from here.
  FS=OFS=","                           ##Setting FS and OFS as comma here. 
}
FNR==NR{                               ##Checking condition which will be true when file2 is being read.
  start[++count]=$2                    ##Creating start array with count variable as as index and has $2 value in it.
  end[count]=$3                        ##Creating end array with count as index and value is $3.
  val[count]=$0                        ##Creating val array with index of count and value as $0.
  next                                 ##next will skip all further statements from here.
}
{
  for(i=1;i<=count;i++){               ##Running for loop till value of count here.
    if(start[i]>$2 && end[i]<$3){      ##Checking condition if start[i]>$2 AND end[i]<$3.
      print $0 OFS val[i]              ##Then printing current line with OFS, val here.
      next                             ##next will skip all further statements from here.
    }
  }
}
1                                      ##1 will print current line here.
' file2 file1                          ##Mentioning Input_file names here.

网友

2楼 · 编辑于 2024-05-13 08:43:51

您可以使用此awk：

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

更具可读性的表格：

awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1

网友

3楼 · 编辑于 2024-05-13 08:43:51

让我们尝试以pandas的方式解决问题，首先将csv文件读入pandas数据帧

f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')

>>> f1

      0        1        2        3                     4              5                     6
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1


>>> f2

    chr      name    start      end
0  chr1  linc1320  3073300  3074300
1  chr3  linc2245  3077270  3078250
2  chr1  linc8956  4410501  4406025

现在我们可以merge和filter满足给定区间包含条件的行，然后我们可以join使用文件f1过滤行

m = f1.reset_index()\
      .merge(f2, left_on=0, right_on='chr')\
      .where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))\
      .set_index('index')[['name', 'start', 'end']]

f3 = f1.join(m)

>>> f3

      0        1        2        3                     4              5                     6      name      start        end
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC  linc1320  3073300.0  3074300.0
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA       NaN        NaN        NaN
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1       NaN        NaN        NaN

PS：您还可以使用f3.to_csv('file3.csv')将生成的数据帧f3保存到csv文件

相关问题更多 >

编程相关推荐

热门问题

热门文章

如果行满足特定范围，如何打印行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >