基于部分字符串匹配，从另一个数据帧填充一个数据帧列

3条回答

网友

1楼 · 编辑于 2024-05-23 16:14:46

我会这样做：

创建一个新列indexes，其中对于df2中的每个Equipment，在df1中找到一个索引列表，其中df1.TagName包含Equipment
通过使用stack()和reset_index()
将展平df2与df1连接起来，以获得所需的所有信息

from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")

df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])

输出：

          Line                          TagName EquipmentDescription  EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL        Lifting table      1311256
187879  PT_WOA      .ZS01_LA120_T05.SB._CBAbsHy        Lifting table      1311256
187880  PT_WOA   .ZS01_LA120_T05.SB.S3110_CBAPV        Lifting table      1311256
187878  PT_WOA   .ZS01_RB2202_T05.SB.S2385_FLOK           Roller bed      1311259
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy           Roller bed      1311260

网友

2楼 · 编辑于 2024-05-23 16:14:46

初始化提供的数据帧：

import numpy as np
import pandas as pd

df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
                    ['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
                    ['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
                   columns = ['Line', 'TagName', 'CLASS'],
                   index = [187877, 187878, 187879, 187880, 187881])

df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
                    [1311257, 'Roller bed', 'RB2200'],
                    [1311258, 'Lifting table', 'LT2202'],
                    [1311259, 'Roller bed', 'RB2202'],
                    [1311260, 'Roller bed', 'RB2204']],
                  columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])

我建议如下：

# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])

# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan

# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
    for index_df2, row_df2 in df2.iterrows():

        # check if 'Equipment' is in 'TagName'
        if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:

            # set 'EquipmentDescription' and 'EquipmentNo'
            df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
            df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']


# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)

这将产生以下数据帧：

        Line    TagName                         EquipmentDescription EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK  Roller bed           1311259
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy     Lifting table        1311256
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV  Lifting table        1311256
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed           1311260

让我知道这是否有帮助

网友

3楼 · 编辑于 2024-05-23 16:14:46

给定df1和df2如下：

`df1`

|    | Line   | TagName                         |   CLASS |
| -:|:   -|:                |    :|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL |      10 |
|  1 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  |      10 |
|  2 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     |      10 |
|  3 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  |      10 |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy |      10 |

`df2`

|    |   EquipmentNo | EquipmentDescription   | Equipment   |
| -:|       :|:           -|:      |
|  0 |       1311256 | Lifting table          | LA120       |
|  1 |       1311257 | Roller bed             | RB2200      |
|  2 |       1311258 | Lifting table          | LT2202      |
|  3 |       1311259 | Roller bed             | RB2202      |
|  4 |       1311260 | Roller bed             | RB2204      |

在df2中的Equipment中查找唯一的equipment

equipment = df2.Equipment.unique().tolist()

通过在equipment中查找匹配项，在df1中创建Equipment列

df1['Equipment'] = df1['TagName'].apply(lambda x: ''.join([part for part in equipment if part in x]))

在Equipment上合并成最终形式
- 如果不希望在df_final中使用Equipment列，请将.drop(columns=['Equipment'])添加到下一行代码的末尾

df_final = df1[['Line', 'TagName', 'Equipment']].merge(df2, on='Equipment')

`df_final`

|    | Line   | TagName                         | Equipment   |   EquipmentNo | EquipmentDescription   |
| -:|:   -|:                |:      |       :|:           -|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL | LA120       |       1311256 | Lifting table          |
|  1 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     | LA120       |       1311256 | Lifting table          |
|  2 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  | LA120       |       1311256 | Lifting table          |
|  3 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  | RB2202      |       1311259 | Roller bed             |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy | RB2204      |       1311260 | Roller bed             |

`df1`

`df2`

`df_final`

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于部分字符串匹配，从另一个数据帧填充一个数据帧列

df1

df2

df_final

相关问题 更多 >

编程相关推荐

热门问题

热门文章

`df1`

`df2`

`df_final`

相关问题更多 >