回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我有这样的遗传学数据:</p>
<pre><code>MUT1 G_->_A_(het) 44%_(96)___[45%_(49)_/_43%_(47)] rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
MUT1 A_->_G_(homo) 99%_(297)___[99%_(151)_/_99%_(146)] rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
MUT1 G_->_C_(homo) 100%_(259)___[100%_(132)_/_100%_(127)] COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
MUT1 A_->_C_(het) 41%_(103)___[42%_(53)_/_40%_(50)] COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5
</code></pre>
<p>我需要解析这些数据并只提取一些字段。在</p>
<p>所需输出为:</p>
^{pr2}$
<p>所以输出应该是-<strong>所有第一列</strong>列,从第二列到第二列只有<strong>het或hom</strong>,第三列是<strong>%</strong>,第五列应该只提取<strong>rs_数</strong>——这总是不同的位置和最后一列。在</p>
<p>注:我知道,关于人/人的信息总是在第二栏的最后一个栏位。而且%总是在第三列的第一个字段上。在</p>
<p>我的解决方案是:</p>
<pre><code>awk -v OFS="\t" '{print $1,$5,$6,$9,$10,$11}' zkouska.csv | awk -v OFS="\t" 'NR>1{split($2,arr2,"_"); split($3,arr3,"_"); print $1,arr2[4],arr3[1],$4,$5,$6}'
</code></pre>
<p>但输出是:</p>
<pre><code>BRCA1 (het) 44% rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
BRCA1 (homo) 99% rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
BRCA1 (homo) 100% COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
BRCA1 (het) 41% COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5
BRCA1 (homo) 100% COSM148277_(COSMIC),_COSM3755561_(COSMIC),_rs16942_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.3548A>G 5
</code></pre>
<p>从第五列中提取<strong>rs</strong>仍然存在问题。删除第二个字段中的引号。输入和输出应该用制表符分开。<strong>解决方案不可能只有awk。</strong></p>