如何清理这些数据

2024-05-23 07:19:19 发布

您现在位置:Python中文网/ 问答频道 /正文

由此:


+------+------+--------------------------+-----------------+
| code | type |           name           | final_component |
+------+------+--------------------------+-----------------+
| C001 | ACT  | Exhaust Blower Drive     |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C001 | SET  | Exhaust Blower Drive     |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C002 | ACT  | Spray Pump Motor 1 Pump  |                 |
| C002 | SET  | Spray Pump Motor 1 Pump  |                 |
| C003 | ACT  | Spray Pump Motor 2 Pump  |                 |
| C003 | SET  | Spray Pump Motor 2 Pump  |                 |
| C004 | ACT  | Spray Pump Motor 3 Pump  |                 |
| C004 | SET  | Spray Pump Motor 3 Pump  |                 |
+------+------+--------------------------+-----------------+


预期:

+------+------+--------------------------+--------------------------+
| code | type |           name           |     final_component      |
+------+------+--------------------------+--------------------------+
| C001 | ACT  | Exhaust Blower Drive     | Exhaust Blower Drive     |
| C001 | AL   |                          | Exhaust Blower Drive     |
| C001 | AL   |                          | Exhaust Blower Drive     |
| C001 | SET  | Exhaust Blower Drive     | Exhaust Blower Drive     |
| C001 | AL   |                          | Exhaust Blower Drive     |
| C001 | AL   |                          | Exhaust Blower Drive     |
| C001 | AL   |                          | Exhaust Blower Drive     |
| C002 | ACT  | Spray Pump Motor 1 Pump  | Spray Pump Motor 1 Pump  |
| C002 | SET  | Spray Pump Motor 1 Pump  | Spray Pump Motor 1 Pump  |
| C003 | ACT  | Spray Pump Motor 2 Pump  | Spray Pump Motor 2 Pump  |
| C003 | SET  | Spray Pump Motor 2 Pump  | Spray Pump Motor 2 Pump  |
| C004 | ACT  | Spray Pump Motor 3 Pump  | Spray Pump Motor 3 Pump  |
| C004 | SET  | Spray Pump Motor 3 Pump  | Spray Pump Motor 3 Pump  |
+------+------+--------------------------+--------------------------+

对于所有相同的代码,我必须将类型为“SET”的name值复制到final_component 与C001一样,“SET”类型的名称为排气鼓风机驱动 我必须将其复制到所有C001的最终组件

for ind in dataframe.index:         
    if dataframe['final_component'][ind]!=None:
        temp = dataframe['final_component'][ind]
        temp_code = dataframe['code'][ind]
    i = ind
    while dataframe['code'][i] == temp_code:
        dataframe['final_component'][ind] = temp
        i+=1

我可以想出这个 但它被困在while循环中


Tags: dataframecodedriveactcomponentfinalalset
2条回答

这里有一种方法。首先,重新创建数据帧:

from io import StringIO
import pandas as pd

data = '''| code | type |           name           | final_component |
| C001 | ACT  | Exhaust Blower Drive     |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C001 | SET  | Exhaust Blower Drive     |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C001 | AL   |                          |                 |
| C002 | ACT  | Spray Pump Motor 1 Pump  |                 |
| C002 | SET  | Spray Pump Motor 1 Pump  |                 |
| C003 | ACT  | Spray Pump Motor 2 Pump  |                 |
| C003 | SET  | Spray Pump Motor 2 Pump  |                 |
| C004 | ACT  | Spray Pump Motor 3 Pump  |                 |
| C004 | SET  | Spray Pump Motor 3 Pump  |                 |
'''
df = pd.read_csv(StringIO(data), sep='|',)
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 5'])

现在,删除前导空格和尾随空格:

# remove leading / trailing spaces
df.columns = [c.strip() for c in df.columns]
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.strip()

并填充final_component

# populate 'final component'
df['final_component'] = df['name']

现在用None替换空字符串并使用ffill()

# find final component that is empty string...
mask = df['final_component'] == ''

# ... and convert to None...
df.loc[mask, 'final_component'] = None

# ...so we can use ffill()
df['final_component'] = df['final_component'].ffill()
print(df)

    code type                     name          final_component
0   C001  ACT     Exhaust Blower Drive     Exhaust Blower Drive
1   C001   AL                              Exhaust Blower Drive
2   C001   AL                              Exhaust Blower Drive
3   C001  SET     Exhaust Blower Drive     Exhaust Blower Drive
4   C001   AL                              Exhaust Blower Drive
5   C001   AL                              Exhaust Blower Drive
6   C001   AL                              Exhaust Blower Drive
7   C002  ACT  Spray Pump Motor 1 Pump  Spray Pump Motor 1 Pump
8   C002  SET  Spray Pump Motor 1 Pump  Spray Pump Motor 1 Pump
9   C003  ACT  Spray Pump Motor 2 Pump  Spray Pump Motor 2 Pump
10  C003  SET  Spray Pump Motor 2 Pump  Spray Pump Motor 2 Pump
11  C004  ACT  Spray Pump Motor 3 Pump  Spray Pump Motor 3 Pump
12  C004  SET  Spray Pump Motor 3 Pump  Spray Pump Motor 3 Pump

解决方案1:当数据按顺序分组时

如果'name'字段中的数据已经有空值,那么可以执行一些简单的操作,如ffill()。函数用于填充数据框中缺少的值。'“ffill”代表“forward fill”,将向前传播最后一个有效观察值。在这种情况下,它不考虑code中的值。如果你还想考虑这个问题,那么请看解决方案2。

import pandas as pd
import numpy as np

a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
     'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
     'name':['Exhaust Blower Drive',None,None,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
             'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
             'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
             'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}

df = pd.DataFrame(a)

#copy all the values  from name to final_component' with ffill()
#it will fill the values where data does not exist
#this will work only if you think all values above are part of the same set

df['final_component'] = df['name'].ffill()

解决方案2:当数据必须基于另一列值时

如果需要根据代码中的值进行填充,可以使用以下解决方案

您可以执行查找,然后更新值。试试这样的

import pandas as pd
import numpy as np
a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
     'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
     'name':['Exhaust Blower Drive',np.nan,np.nan,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
             'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
             'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
             'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}

df = pd.DataFrame(a)

#copy all the values  from name to final_component' including nulls
df['final_component'] = df['name']
#create a sublist of items based on unique values in code
lookup = df[['code', 'final_component']].groupby('code').first()['final_component']
#identify all the null values that need to be replaced
noname=df['final_component'].isnull()
#replace all null values with correct value based on lookup
df['final_component'].loc[noname] = df.loc[noname].apply(lambda x: lookup[x['code']], axis=1)

print(df)

输出如下所示:

    code  typ                     name          final_component
0   C001  ACT     Exhaust Blower Drive     Exhaust Blower Drive
1   C001   AL                      NaN     Exhaust Blower Drive
2   C001   AL                      NaN     Exhaust Blower Drive
3   C001  SET     Exhaust Blower Drive     Exhaust Blower Drive
4   C001   AL                      NaN     Exhaust Blower Drive
5   C001   AL                      NaN     Exhaust Blower Drive
6   C001   AL                      NaN     Exhaust Blower Drive
7   C002  ACT  Spray Pump Motor 1 Pump  Spray Pump Motor 1 Pump
8   C002  SET  Spray Pump Motor 1 Pump  Spray Pump Motor 1 Pump
9   C003  ACT  Spray Pump Motor 2 Pump  Spray Pump Motor 2 Pump
10  C003  SET  Spray Pump Motor 2 Pump  Spray Pump Motor 2 Pump
11  C004  ACT  Spray Pump Motor 3 Pump  Spray Pump Motor 3 Pump
12  C004  SET  Spray Pump Motor 3 Pump  Spray Pump Motor 3 Pump

相关问题 更多 >

    热门问题