我有一个python脚本,它从word文件中的表中提取数据并将其转换为dataframe阿拉伯语文本。问题是,当我尝试显示dataframe时,它会将每个记录显示两次,并且无法删除重复的记录
import pandas as pd
import docx
document = docx.Document(path)
table = document.tables[0]
data = []
for row_index, row in enumerate(table.rows): # Loop through rows
data.append([]) # Add container list for each row.
for col_index in range(13): # Loop through columns
cell_text= row.cells[col_index].paragraphs[0].text.encode('utf-8')
cell_decode_text = cell_text.decode('utf-8')
data[row_index].append(cell_decode_text)
df = pd.DataFrame(data)
df.columns=["group","person","category","source","dds","time","date","location","text","title","date_export","num_export",""]
df.drop_duplicates()
df.head(20)
'date_export': {0: 'تاريخ الصادر',
1: '',
2: '2020/8/23',
3: '2020/8/23',
4: '2020/8/23',
5: '2020/8/23',
6: '2020/8/23',
7: '2020/8/23',
8: '2020/8/23',
9: '2020/8/23',
10: '2020/8/23',
11: '2020/8/23',
12: '2020/8/23'},
'num_export': {0: 'رقم الصادر',
1: 'رقم الصادر',
2: '36015',
3: '36015',
4: '36016',
5: '36016',
6: '36017',
7: '36017',
8: '36018',
9: '36018',
10: '36019',
11: '36019',
12: '36020'},
使用您提供的数据集,下面的示例显示了如何使用
df.drop_duplicates(inplace=True)
完成任务;正如@Chinte在他们的回答中也提到的那样之前:
之后:
你必须把它设置好
df.drop_duplicates(inplace=True)
相关问题 更多 >
编程相关推荐