从pythonPandas的dataframe列中搜索匹配的字符串模式

2024-06-01 01:25:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下的数据框

 name         genre
 satya      |ACTION|DRAMA|IC|
 satya      |COMEDY|BIOPIC|SOCIAL|
 abc        |CLASSICAL|
 xyz        |ROMANCE|ACTION|DARMA|
 def        |DISCOVERY|SPORT|COMEDY|IC|
 ghj        |IC|

现在我想查询数据帧,这样就可以得到第1、5和6行,即:我想找到单独使用或与其他类型的任何组合使用的| IC |。

到目前为止,我可以使用

df[df['genre'] == '|ACTION|DRAMA|IC|']  ######exact value yields row 1

或字符串包含

 df[df['genre'].str.contains('IC')]  ####yields row 1,2,3,5,6
 # as BIOPIC has IC in that same for CLASSICAL also

但我不想要这两个。

#df[df['genre'].str.contains('|IC|')]  #### row 6
# This also not satisfying my need as i am missing rows 1 and 5

因此,我的要求是找到包含| IC |的类型(我的字符串搜索失败,因为python将|视为or运算符)

有人建议一些注册或任何方法来做到这一点


Tags: 数据字符串类型dfactionrowicstr
2条回答

可能是这种结构:

    pd.DataFrame[DataFrame['columnName'].str.contains(re.compile('regex_pattern'))]

我认为您可以将\添加到regex中以进行转义,因为|而不使用\被解释为^{}

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

print df['genre'].str.contains(u'\|IC\|')
0     True
1    False
2    False
3    False
4     True
5     True
Name: genre, dtype: bool

print df[df['genre'].str.contains(u'\|IC\|')]
    name                        genre
0  satya            |ACTION|DRAMA|IC|
4    def  |DISCOVERY|SPORT|COMEDY|IC|
5    ghj                         |IC|

相关问题 更多 >