Pandas数据帧按多列分组并删除重复行问题的回答

Pandas数据帧按多列分组并删除重复行

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

为了创建新的数据帧，您可以使用pandas条件切片：（在您的问题中，您在索引5行的数据框<code>NORMAL</code>上犯了一个错误，<code>SampleType</code>应该是<code>Normal</code>而不是{<cd4>}） <pre><code>NORMAL = df[df['SampleType']=='Normal'].copy() TUMOR = df[df['SampleType']=='Tumor'].copy() </code></pre> 或者，如果你有机会拥有<code>'normal'</code>和<code>'tumor'</code>之外的任何东西，你不想得到除了<code>'normal'</code>之外的所有东西： ^{pr2}$ 然后，为了删除重复项并保留特定值，可以创建另一列，该列保留相同的信息，但由整数组成（比字符串列表更容易排序）： <pre><code>NORMAL['Whatever'] = 0 TUMOR['Whatever'] = 0 </code></pre> 当然，可以在拆分数据帧df之前执行此操作（然后只对一个数据帧执行此操作，而不是对两个数据帧执行此操作）。填写此栏： <pre><code>NORMAL.ix[NORMAL['Reference'] == 'HG19','Whatever'] = 1 TUMOR.ix[TUMOR['Reference'] == 'HG19','Whatever'] = 1 NORMAL.ix[NORMAL['Reference'] == 'HG18','Whatever'] = 2 TUMOR.ix[TUMOR['Reference'] == 'HG18','Whatever'] = 2 </code></pre> 然后按此新列排序，删除重复项，只保留第一行： <pre><code>NORMAL.sort_values(by = 'Whatever', inplace = True) NORMAL.drop_duplicates(subset = 'ID',inplace = True) TUMOR.sort_values(by = 'Whatever', inplace = True) TUMOR.drop_duplicates(subset = 'ID',inplace = True) </code></pre> 为了得到预期的输出，删除临时列，然后按索引进行处理： <pre><code>NORMAL.drop('Whatever',1,inplace = True) NORMAL.sort_index(inplace = True) TUMOR.drop('Whatever',1,inplace = True) TUMOR.sort_index(inplace = True) </code></pre> 输出： <pre><code>Out[3]: NORMAL ID Reference SampleType 3 TCGA-AB-0001 GRCh37 Normal 6 TCGA-AB-0002 GRCh37 Normal Out[32]: TUMOR ID Reference SampleType 0 TCGA-AB-0001 HG19 Tumor 8 TCGA-AB-0003 GRCh37 Tumor 9 TCGA-AB-0002 GRCh37 Tumor </code></pre>

Pandas数据帧按多列分组并删除重复行

1 个回答

相关Python问题