在python datafram中查找范围内的正则表达式

2024-06-16 11:54:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我正面临一个问题

我有一个名为employer的数据框,看起来像:

employer
------------
wings brand activation i pvt ltd
hofincons infotech &industrial services pvt .ltd
bharat fritz werner bangalore
kludi rak indpvt ltd.

另一个将雇主名称映射到如下类别的数据框(称为pincode):

Index   Name                                    FINAL_CATEGORY
68781   central board of excise and customs     cat b
68782   c a g hotels pvt ltd                    cat b
68783   avaneetha textiles pvt ltd              cat a
68784   trendy wheels pvt ltd                   cat a+
68785   wings brand activations india pvt ltd   cat b

现在我想模仿一下:

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt ltd')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i')]


Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation')]

        Name                                    FINAL_CATEGORY
68785   wings brand activations india pvt ltd   cat b

如您所见,对于每一个字符串,我都在减少长度,直到最后一个空格从字符串的末尾开始,然后搜索。你知道吗

上面的内容需要循环使用(我认为是regex)。因此,对于雇主表中的每个条目,它都会搜索pincode的整个范围,并找出最接近的匹配项。如果什么都没有,那就返回nan。你知道吗

提前谢谢,因为这个问题有点棘手的话,请要求任何澄清。你知道吗


Tags: 数据nameactivationcatfinalpvtcategorycontains
2条回答

您可以使用如下迭代方法:

def find_substr(employer, pincode):
    employer = employer.set_index("employer")
    for words in employer.index.map(str.split):
        length = len(words)
        found = False
        while length > 0 and not found:
            substr = ' '.join(words[:length]).replace('(', '\(')
            mask = pincode.Name.str.contains(substr)
            if mask.any():
                employer.loc[' '.join(words), 'cat'] = pincode.loc[mask, 'FINAL_CATEGORY'].values[0]
                found = True
            length -= 1
    employer = employer.reset_index()
    return employer

employer = find_substr(employer, pincode)
print(employer)
                                           employer    cat
0                  wings brand activation i pvt ltd  cat b
1  hofincons infotech &industrial services pvt .ltd    NaN
2                     bharat fritz werner bangalore    NaN
3                              kludi rak indpvt ltd    NaN

这里有一个方法。你知道吗

首先将您的pin df转换成一个字典,将字符串映射到相应的类别。然后使用双列表创建雇员数据框的cat列,以记录与其姓名匹配的所有类别:

# Example df
employer = pd.DataFrame({"employer":["wings brand activation i pvt ltd", "bharat fritz werner bangalore"]})
pins = pd.DataFrame({"Name":["trendy wheels pvt ltd", "wings brand activation i pvt ltd"], "FINAL_CATEGORY":["cat a+", "cat b"]}) 

dict_pins = dict(zip(pins['Name'], pins['FINAL_CATEGORY']))
employer['cat'] = [[dict_pins[key] for key in dict_pins.keys() if x in key] for x in employer['employer']]

输出:

                           employer      cat
0  wings brand activation i pvt ltd  [cat b]
1     bharat fritz werner bangalore       []

相关问题 更多 >