基于部分字符串匹配,从另一个数据帧填充一个数据帧列

2024-05-23 16:14:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python编程新手。我有两个数据帧df1包含标签(180k行),df2包含设备名称(1600行)

df1:

          Line                TagName                CLASS 
187877    PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL     10
187878    PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK      10
187879    PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy         10
187880    PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV      10
187881    PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy     10

df2:

EquipmentNo EquipmentDescription    Equipment
1311256        Lifting table         LA120
1311257        Roller bed            RB2200
1311258        Lifting table         LT2202
1311259        Roller bed            RB2202
1311260        Roller bed            RB2204

df2.Equipment位于df1.TagName中字符串之间的某个位置。我需要根据df2设备是否在df1标记名中进行匹配,然后df2(设备描述和设备编号)必须与df1匹配

最终输出应为

        Line                TagName                quipmentdescription   EquipmentNo 
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL     Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK      Roller bed           1311259  
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy         Lifting table        1311256 
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV      Lifting table        1311256
 187881 PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy     Roller bed           1311260

我现在已经试过了

cols= df2['Equipment'].tolist()
Xs=[]
for i in cols:
    Test = df1.loc[df1.TagName.str.contains(i)] 
    Test['Equip']=i
    Xs.append(Test)

然后根据“设备”合并xs和df2

但我得到了这个错误

first argument must be string or compiled pattern


Tags: pttablerollersbdf1df2bedequipment
3条回答

我会这样做:

  1. 创建一个新列indexes,其中对于df2中的每个Equipment,在df1中找到一个索引列表,其中df1.TagName包含Equipment

  2. 通过使用stack()reset_index()

  3. 将展平df2与df1连接起来,以获得所需的所有信息
from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")

df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])

输出:

          Line                          TagName EquipmentDescription  EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL        Lifting table      1311256
187879  PT_WOA      .ZS01_LA120_T05.SB._CBAbsHy        Lifting table      1311256
187880  PT_WOA   .ZS01_LA120_T05.SB.S3110_CBAPV        Lifting table      1311256
187878  PT_WOA   .ZS01_RB2202_T05.SB.S2385_FLOK           Roller bed      1311259
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy           Roller bed      1311260

初始化提供的数据帧:

import numpy as np
import pandas as pd

df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
                    ['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
                    ['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
                   columns = ['Line', 'TagName', 'CLASS'],
                   index = [187877, 187878, 187879, 187880, 187881])

df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
                    [1311257, 'Roller bed', 'RB2200'],
                    [1311258, 'Lifting table', 'LT2202'],
                    [1311259, 'Roller bed', 'RB2202'],
                    [1311260, 'Roller bed', 'RB2204']],
                  columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])

我建议如下:

# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])

# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan

# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
    for index_df2, row_df2 in df2.iterrows():

        # check if 'Equipment' is in 'TagName'
        if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:

            # set 'EquipmentDescription' and 'EquipmentNo'
            df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
            df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']


# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)

这将产生以下数据帧:

        Line    TagName                         EquipmentDescription EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK  Roller bed           1311259
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy     Lifting table        1311256
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV  Lifting table        1311256
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed           1311260

让我知道这是否有帮助

  • 给定df1df2如下:

df1

|    | Line   | TagName                         |   CLASS |
| -:|:   -|:                |    :|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL |      10 |
|  1 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  |      10 |
|  2 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     |      10 |
|  3 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  |      10 |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy |      10 |

df2

|    |   EquipmentNo | EquipmentDescription   | Equipment   |
| -:|       :|:           -|:      |
|  0 |       1311256 | Lifting table          | LA120       |
|  1 |       1311257 | Roller bed             | RB2200      |
|  2 |       1311258 | Lifting table          | LT2202      |
|  3 |       1311259 | Roller bed             | RB2202      |
|  4 |       1311260 | Roller bed             | RB2204      |
  1. df2中的Equipment中查找唯一的equipment
equipment = df2.Equipment.unique().tolist()
  1. 通过在equipment中查找匹配项,在df1中创建Equipment
df1['Equipment'] = df1['TagName'].apply(lambda x: ''.join([part for part in equipment if part in x]))
  1. Equipment上合并成最终形式
    • 如果不希望在df_final中使用Equipment列,请将.drop(columns=['Equipment'])添加到下一行代码的末尾
df_final = df1[['Line', 'TagName', 'Equipment']].merge(df2, on='Equipment')

df_final

|    | Line   | TagName                         | Equipment   |   EquipmentNo | EquipmentDescription   |
| -:|:   -|:                |:      |       :|:           -|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL | LA120       |       1311256 | Lifting table          |
|  1 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     | LA120       |       1311256 | Lifting table          |
|  2 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  | LA120       |       1311256 | Lifting table          |
|  3 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  | RB2202      |       1311259 | Roller bed             |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy | RB2204      |       1311260 | Roller bed             |

相关问题 更多 >