基于AWK创建反向补码序列

2024-04-20 01:22:45 发布

您现在位置:Python中文网/ 问答频道 /正文

亲爱的stackoverflow用户:

我有如下选项卡sep数据:

head -4 input.tsv

seq A      C change
seq T      A ok
seq C      C change
seq AC   CCT change

我需要在awk中创建反向补码函数,这样做

head -4 output.tsv

seq T      G change
seq T      A ok
seq G      G change
seq GT   AGG change

所以,如果第四列标记为“change”,我需要创建反向补码序列

提示-在bash-bash中执行相同的操作,例如tr此任务的一行代码是:

echo "ACCGA" | rev | tr "ATGC" "TACG"

我被试过这样的事

awk 'BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }{OFS="\t"}
function revcomp(   i, o) {
    o = ""
    for(i = length; i > 0; i--)
        o = o c[substr($0, i, 1)]
    return(o)
}
{

if($4 == "change"){$2 = revcom(); $3 = revcom()} print $0; else print $0}' input

生物反向序列平均值:

A => T
C => G
G => C
T => A

和反向补码的意思是:

ACCATG => CATGGT

编辑:任何人都可以在python中分享这个解决方案


Tags: 用户bashinputtsvok序列changestackoverflow
3条回答

请您尝试以下内容,用显示的样本编写和测试(GNUawk

awk '
BEGIN{
  label["A"]="T"
  label["C"]="G"
  label["G"]="C"
  label["T"]="A"
}
function cVal(field){
  delete array
  num=split($field,array,"")
  for(k=1;k<=num;k++){
   if(array[k] in label){
     val=label[array[k]] val
   }
  }
  $field=val
  val=""
}
$NF=="change"{
  for(i=2;i<=(NF-1);i++){
    cVal(i)
  }
}
1
'  Input_file | column -t

解释:添加上述代码的详细解释

awk '                                ##Starting awk program from here.
BEGIN{                               ##Starting BEGIN section of this code here.
  label["A"]="T"                     ##Creating array label with index A and value T.
  label["C"]="G"                     ##Creating array label with index C and value G.
  label["G"]="C"                     ##Creating array label with index G and value C.
  label["T"]="A"                     ##Creating array label with index T and value A.
}
function cVal(field){                ##Creating function named cVal here with passing value field in it.
  delete array                       ##Deleting array here.
  num=split($field,array,"")         ##Splitting current field value passed to it and creating array.
  for(k=1;k<=num;k++){               ##Running for loop fromk=1 to till value of num here.
   if(array[k] in label){            ##Checking condition if array value with index k is present in label array then do following.
     val=label[array[k]] val         ##Creating val which has label value with index array with index k and keep concatenating its value to it.
   }
  }
  $field=val                         ##Setting current field value to val here.
  val=""                             ##Nullifying val here.
}
$NF=="change"{                       ##Checking condition if last field is change then do following.
  for(i=2;i<=(NF-1);i++){            ##Running for loop from 2nd field to 2nd last field.
    cVal(i)                          ##Calling function with passing current field number to it.
  }
}
1                                    ##1 will print current line here.
' Input_file | column -t             ##Mentioning Input_file name here.

只要对你的尝试稍加修改,你就可以做如下的事情

function revcomp(arg) {
    o = ""
    for(i = length(arg); i > 0; i--)
        o = o c[substr(arg, i, 1)]
    return(o)
}

BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" ; OFS="\t"}

{
    if($4 == "change") {
        $2 = revcomp($2); 
        $3 = revcomp($3)
    } 
}1

这里的关键是使用函数revcomp将参数作为列值,并通过从末尾迭代对其进行操作。您以前在整行$0上执行过,即substr($0, i, 1),这将导致对数组c进行大量异常查找

我还随意更改了函数revcomp的原型,以获取输入字符串并返回相反的字符串。因为我不确定你在最初的尝试中打算如何使用

如果您打算在更大的脚本的一部分中使用上述内容,我建议将上述整个代码放在脚本文件中,将she-bang解释器设置为#!/usr/bin/awk -f,并将脚本作为awk -f script.awk input.tsv运行

awk中实现的原始bash版本如下所示。请注意,它不干净,也不是推荐的方法。详见AllAboutGetline

与前面一样,将函数调用为$2 = revcomp_bash($2)$3 = revcomp_bash($3)

function revcomp_bash(arg) {
    o = ""
    cmd = "printf \"%s\" " arg "| rev | tr \"ATGC\" \"TACG\""
    while ( ( cmd | getline o ) > 0 ) { 

    }
    close(cmd);
    return(o)
}

您的整个代码都讲GNUawk-ism,所以不想将其转换为与POSIX兼容的代码。您可以将split()与空的反限制器一起使用,而不是length(),但是POSIX规范高兴地说,“空字符串作为fs的值的效果是未指定的。”

对于这个特定的应用程序来说效率有点低,因为它在每次调用tr()时都会创建映射数组,并在tr()中执行相同的循环,然后再在rev()中执行相同的循环,但我想我会展示如何编写独立的tr()rev()函数,而且它可能足够快,满足您的需要:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
$4 == "change" {
    for ( i=2; i<=3; i++) {
        $i = rev(tr($i,"ACGT","TGCA"))
    }
}
{ print }

function tr(instr,old,new,      outstr,pos,map) {
    for (pos=1; pos<=length(old); pos++) {
        map[substr(old,pos,1)] = substr(new,pos,1)
    }
    for (pos=1; pos<=length(instr); pos++) {
        outstr = outstr map[substr(instr,pos,1)]
    }
    return outstr
}

function rev(instr,     outstr,pos) {
    for (pos=1; pos<=length(instr); pos++) {
        outstr = substr(instr,pos,1) outstr
    }
    return outstr
}

$ awk -f tst.awk file
seq     T       G       change
seq     T       A       ok
seq     G       G       change
seq     GT      AGG     change

相关问题 更多 >