在Pandas DataFrame中查找字符串模式并返回匹配字符串

5 投票

1 回答

10148 浏览

提问于 2025-04-18 00:16

我有一个数据框的列，里面是用逗号分隔的文本，想从中提取出根据另一个列表找到的值。我的数据框看起来是这样的：

col1 | col2
-----------
 x   | a,b


listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)

def test_for_pattern(x):
    if re.search(pattern, x):
        return pattern
    else:
        return x

#also can use col2.str.contains(pattern) for same results

上面的过滤效果很好，但当找到匹配项时，它返回的是整个模式，比如 a|b，而不是我想要的 b。我希望能创建另一个列，里面只包含找到的模式，比如 b。

这是我最终的函数，但仍然出现了 UserWarning: This pattern has match groups. To actually get the groups, use str.extract. 的警告。我希望能解决这个问题：

def matching_func(file1, file2):
    file1 = pd.read_csv(fin)
    file2 = pd.read_excel(fin1, 0, skiprows=1)
    pattern = '|'.join(file1[col1].tolist())
    file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
                                             if re.search(pattern, x) else None)

我想我现在明白 pandas 的提取是怎么回事了，但可能对正则表达式还是有点生疏。我该如何为下面的例子创建一个模式变量：

df[col1].str.extract('(word1|word2)')

我想用一个变量来代替参数中的单词，比如 pattern = 'word1|word2'，但这样做不行，因为字符串的创建方式不对。

这是我最终想要的版本，使用 pandas 0.13 中的向量化字符串方法：

使用一列的值从第二列中提取：

df[col1].str.extract('({})'.format('|'.join(df[col2]))

正则表达式数据处理数据提取 pandas 数据框列操作向量化方法字符串模式

1 个回答

你可能会想用 extract，或者其他一些向量化字符串方法：

In [11]: s = pd.Series(['a', 'a,b'])

In [12]: s.str.extract('([cdfb])')
Out[12]:
0    NaN
1      b
dtype: object

回答于 2025-04-18 由 Python大师

分享举报

在Pandas DataFrame中查找字符串模式并返回匹配字符串

1 个回答

撰写回答