Python：检查关键字是否在字符串中拆分

def matcher(x,word_dict): match="" for i in list(dict.fromkeys(word_dict)): if i.replace(" ", "").lower() in x.replace(" ", "").lower(): if(match==""): match=i else: match=match+"_"+i return match import pandas as pd df = pd.DataFrame({'ID' : ['1', '2', '3', '4','5'], 'Text' : ['sample 123 45 678 text','sample as123456 text','sample As123 456','sample bas123456 text','sample bas123 456ts text']}, columns = ['ID','Text']) master_dict= pd.DataFrame({'Keyword' : ['12345678','as123456']}, columns = ['Keyword']) df['Match']=df['Text'].apply(lambda x: matcher(x,master_dict.Keyword)) Expected Output ID Text Match 0 1 sample 123 45 678 text 12345678 1 2 sample as123456 text as123456 2 3 sample As123 456 as123456 3 4 sample bas123456 text NA 4 5 sample bas123 456ts text NA

2条回答

网友

1楼 · 编辑于 2024-05-26 17:43:51

如果该字符串是另一个字符串的一部分，则使用in函数进行检查将得到true，我认为使用：

if string == keyword:

在处理空格后，将产生您想要的结果，因此如果结果与关键字不完全相等，则应返回False

让我知道我是否正确理解了你的要求，以及它是否有帮助

网友

2楼 · 编辑于 2024-05-26 17:43:51

您可以使用myprevious solution的熊猫改编：

import pandas as pd
import numpy as np
import re

df = pd.DataFrame({'ID' : ['1', '2', '3', '4','5'], 
        'Text' : ['sample 123 45 678 text','sample as123456 text','sample As123 456','sample bas123456 text','sample bas123 456ts text']}, 
        columns = ['ID','Text'])
master_dict= pd.DataFrame({'Keyword' : ['12345678','as123456']}, 
                  columns = ['Keyword'])

words = master_dict['Keyword'].to_list()
words_dict = { f'g{i}':item for i,item in enumerate(words) } 
rx = re.compile(r"(?i)\b(?:" + '|'.join([ r'(?P<g{}>{})'.format(i,"[\W_]*".join([c for c in item])) for i,item in enumerate(words)]) + r")\b")
print(rx.pattern)

def findvalues(x):
    m = rx.search(x)
    if m:
        return [words_dict.get(key) for key,value in m.groupdict().items() if value][0]
    else:
        return np.nan

df['Match'] = df['Text'].apply(lambda x: findvalues(x))

模式是

(?i)\b(?:(?P<g0>1[\W_]*2[\W_]*3[\W_]*4[\W_]*5[\W_]*6[\W_]*7[\W_]*8)|(?P<g1>a[\W_]*s[\W_]*1[\W_]*2[\W_]*3[\W_]*4[\W_]*5[\W_]*6))\b

见regex demo。基本上，它是一个\b(?:keyword1|keyword2|...|keywordN)\b正则表达式，每个字符之间都有[\W_]*（匹配任何零个或多个非字母数字字符）。由于\b是单词边界，因此关键字仅作为整个单词匹配。它将适用于您的关键字，因为您确认它们是数字或字母数字

演示输出：

>>> df
  ID                      Text     Match
0  1    sample 123 45 678 text  12345678
1  2      sample as123456 text  as123456
2  3          sample As123 456  as123456
3  4     sample bas123456 text       NaN
4  5  sample bas123 456ts text       NaN
>>>

相关问题更多 >

编程相关推荐

热门问题

热门文章