<p><code>Pandas</code>提供了一种非常简单的方法来实现这个<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">pandas.DataFrame.drop_duplicates</a></p>
<p>给定存储在当前工作目录中的以下文件(<code>data.csv</code>)</p>
<pre><code>name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
John Doe,25,50000
Louise Jones,25,50000
</code></pre>
<p>以下脚本可用于删除重复记录,将处理后的数据写入当前工作目录(<code>processed_data.csv</code>)中的csv文件</p>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates()
df.to_csv("processed_data.csv", index=False)
</code></pre>
<p>本例中的结果输出如下所示:</p>
<pre><code>name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Louise Jones,25,50000
</code></pre>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">pandas.DataFrame.drop_duplicates</a>还允许从特定列中删除重复的属性(而不仅仅是整行的重复属性),使用<code>subset</code>参数指定列名</p>
<p>例如</p>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates(subset=["age"])
df.to_csv("processed_data.csv", index=False)
</code></pre>
<p>将删除<code>age</code>列中的所有重复值,只保留第一条记录,该记录包含在以后记录的<code>age</code>字段中重复的值</p>
<p>在本例中,输出为:</p>
<pre><code>name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
</code></pre>