Python Pandas Regex:在列中搜索带有通配符的字符串并返回匹配项

2024-06-16 09:27:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一列中有一个搜索列表,该列可能包含一个键:'keyword1*keyword2'以尝试在单独的数据帧列中查找匹配项。如何包含regex通配符类型'keyword1.*keyword2'#using str.extract, extractall or findall?

使用.str.extract可以很好地匹配精确的子字符串,但是我需要它在关键字之间使用通配符来匹配子字符串。在

# dataframe column or series list as keys to search for: 
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]

# dataframe column to search in: 
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]

# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)

我也试过用extractall和findall替换上面代码中的extract,但是仍然不能得到我需要的结果。 我希望'Keyword1*Keyword2'"strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"匹配

更新:“.*”起作用了! 我还试图从'SearchFor'列中匹配键旁边的单元格中添加值,即dfKeys['AdjacentCol']。在

我试过: df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('')它适用于除带有通配符的键之外的所有内容。在

^{pr2}$

如果有任何帮助,我们将不胜感激。谢谢!在


Tags: tostringmoreextractsomethingdf1secondstuff
1条回答
网友
1楼 · 发布于 2024-06-16 09:27:04

解决方案

您已经接近解决方案,只需将*更改为.*。正在读取docs

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

在正则表达式中,仅星号*本身就没有任何意义。它与Unix/Windows文件系统中常用的glob运算符*有不同的含义。在

星符号是一个量词(即gready量词),它必须与某种模式相关联(这里.来匹配任何字符)以表示某种意义。在

MCVE

重塑你的MCVE:

import re
import pandas as pd

keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

df1 = pd.DataFrame()
df1['Description'] = ['Something Here','Second Item 7', 'Something There',
                      'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
                      'Second Item 7', 'Even More Stuff']


regstr = '(%s)' % '|'.join(keys)

df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)

regexp现在是:

^{pr2}$

与缺失的案例相匹配:

                                         Description                                Matched
0                                     Something Here                              Something
1                                      Second Item 7                                 Second
2                                    Something There                              Something
3  strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an...  KEYWORD1 moreJARGON 06/0 010 KEYWORD2
4                                      Second Item 7                                 Second
5                                    Even More Stuff                                  Stuff

相关问题 更多 >