如何从数据帧中删除重复记录?

2024-04-26 06:29:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个python脚本,它从word文件中的表中提取数据并将其转换为dataframe阿拉伯语文本。问题是,当我尝试显示dataframe时,它会将每个记录显示两次,并且无法删除重复的记录

代码:

import pandas as pd
import docx

document = docx.Document(path)
table = document.tables[0]

data = []

for row_index, row in enumerate(table.rows): # Loop through rows
    data.append([]) # Add container list for each row.
    for col_index in range(13): # Loop through columns 
        cell_text= row.cells[col_index].paragraphs[0].text.encode('utf-8')
        cell_decode_text = cell_text.decode('utf-8')
        data[row_index].append(cell_decode_text)

df = pd.DataFrame(data)
df.columns=["group","person","category","source","dds","time","date","location","text","title","date_export","num_export",""]
df.drop_duplicates()
df.head(20)

结果:

 'date_export': {0: 'تاريخ الصادر',
  1: '',
  2: '2020/8/23',
  3: '2020/8/23',
  4: '2020/8/23',
  5: '2020/8/23',
  6: '2020/8/23',
  7: '2020/8/23',
  8: '2020/8/23',
  9: '2020/8/23',
  10: '2020/8/23',
  11: '2020/8/23',
  12: '2020/8/23'},
 'num_export': {0: 'رقم الصادر',
  1: 'رقم الصادر',
  2: '36015',
  3: '36015',
  4: '36016',
  5: '36016',
  6: '36017',
  7: '36017',
  8: '36018',
  9: '36018',
  10: '36019',
  11: '36019',
  12: '36020'},

2条回答

使用您提供的数据集,下面的示例显示了如何使用df.drop_duplicates(inplace=True)完成任务;正如@Chinte在他们的回答中也提到的那样

之前:

>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
3   2020/8/23   36015
4   2020/8/23   36016
5   2020/8/23   36016
6   2020/8/23   36017
7   2020/8/23   36017
8   2020/8/23   36018
9   2020/8/23   36018
10  2020/8/23   36019
11  2020/8/23   36019
12  2020/8/23   3602

之后:

>>> df.drop_duplicates(inplace=True)
>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
4   2020/8/23   36016
6   2020/8/23   36017
8   2020/8/23   36018
10  2020/8/23   36019
12  2020/8/23   36020

你必须把它设置好

df.drop_duplicates(inplace=True)

相关问题 更多 >