<p>为了创建新的数据帧,您可以使用pandas条件切片:(在您的问题中,您在索引5行的数据框<code>NORMAL</code>上犯了一个错误,<code>SampleType</code>应该是<code>Normal</code>而不是{<cd4>})</p>
<pre><code>NORMAL = df[df['SampleType']=='Normal'].copy()
TUMOR = df[df['SampleType']=='Tumor'].copy()
</code></pre>
<p>或者,如果你有机会拥有<code>'normal'</code>和<code>'tumor'</code>之外的任何东西,你不想得到除了<code>'normal'</code>之外的所有东西:</p>
^{pr2}$
<p>然后,为了删除重复项并保留特定值,可以创建另一列,该列保留相同的信息,但由整数组成(比字符串列表更容易排序):</p>
<pre><code>NORMAL['Whatever'] = 0
TUMOR['Whatever'] = 0
</code></pre>
<p>当然,可以在拆分数据帧df之前执行此操作(然后只对一个数据帧执行此操作,而不是对两个数据帧执行此操作)。填写此栏:</p>
<pre><code>NORMAL.ix[NORMAL['Reference'] == 'HG19','Whatever'] = 1
TUMOR.ix[TUMOR['Reference'] == 'HG19','Whatever'] = 1
NORMAL.ix[NORMAL['Reference'] == 'HG18','Whatever'] = 2
TUMOR.ix[TUMOR['Reference'] == 'HG18','Whatever'] = 2
</code></pre>
<p>然后按此新列排序,删除重复项,只保留第一行:</p>
<pre><code>NORMAL.sort_values(by = 'Whatever', inplace = True)
NORMAL.drop_duplicates(subset = 'ID',inplace = True)
TUMOR.sort_values(by = 'Whatever', inplace = True)
TUMOR.drop_duplicates(subset = 'ID',inplace = True)
</code></pre>
<p>为了得到预期的输出,删除临时列,然后按索引进行处理:</p>
<pre><code>NORMAL.drop('Whatever',1,inplace = True)
NORMAL.sort_index(inplace = True)
TUMOR.drop('Whatever',1,inplace = True)
TUMOR.sort_index(inplace = True)
</code></pre>
<p>输出:</p>
<pre><code>Out[3]: NORMAL
ID Reference SampleType
3 TCGA-AB-0001 GRCh37 Normal
6 TCGA-AB-0002 GRCh37 Normal
Out[32]: TUMOR
ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor
8 TCGA-AB-0003 GRCh37 Tumor
9 TCGA-AB-0002 GRCh37 Tumor
</code></pre>