如果行值包含列表中的项作为子字符串,请将行值保存到其他数据帧

2024-04-20 07:30:56 发布

您现在位置:Python中文网/ 问答频道 /正文

如果行值包含列表中的项作为子字符串,请将行值保存到其他数据帧

输入数据帧:

index    link
1      https://zeewhois.com/en/
2      https://www.phpfk.de/domain
3      https://www.phpfk.de/de/domain
4      https://laseguridad.online/questions/1040/pued

list=['verizon','zeewhois','idad']

如果df['link']将list的任何项作为子字符串,我们需要将该特定的link放在不同的新数据帧中

到目前为止,我已经对link列进行了预处理,并购买了以下格式:

index    link
1      httpszeewhoiscomenwww
2      httpswwwphpfkdedomain
3      httpswwwphpfkdededomain
4      httpslaseguridadonlinequestions1040pued

查找哪些行值包含作为子字符串的list中的项 df["TRUEFALSE"] = df['link'].apply(lambda x: 1 if any(i in x for i in list) else 0)

但我得到了一个错误:

TypeError: 'in <string>' requires string as left operand, not float

Tags: 数据字符串inhttpsdfstringindexdomain
2条回答

您可以使用str.contains

list_strings =['verizon','zeewhois','idad']

df.loc[df.link.str.contains('|'.join(list_strings),case=False), 'TRUE_FALSE'] = True



 index             link                                TRUE_FALSE
    1   https://zeewhois.com/en/                        True
    2   https://www.phpfk.de/domain                     NaN
    3   https://www.phpfk.de/de/domain                  NaN
    4   https://laseguridad.online/questions/1040/pued  True

然后只需过滤True就可以得到新的数据帧

new_df = df.loc[df.TRUE_FALSE == True].copy()

index               link                        TRUE_FALSE
1   https://zeewhois.com/en/                        True
4   https://laseguridad.online/questions/1040/pued  True

您不需要处理link。您可以简单地执行以下操作:

In [51]: import numpy as np

In [47]: df                                                                                                                                                                                                 
Out[47]: 
                                                 link
index                                                
1                            https://zeewhois.com/en/
2                         https://www.phpfk.de/domain
3                      https://www.phpfk.de/de/domain
4      https://laseguridad.online/questions/1040/pued

l =['verizon','zeewhois','idad'] ## It's not nice to have variable with names like list,dict etc.

In [50]: def match(x): 
    ...:     for i in l: 
    ...:         if i.lower() in x.lower(): 
    ...:             return i 
    ...:     else: 
    ...:         return np.nan 
    ...:                     

In [48]: new_df = df[df['link'].apply(match).notna()] 

In [49]: new_df                                                                                                                                                                                             
Out[49]: 
                                                 link
index                                                
1                            https://zeewhois.com/en/
4      https://laseguridad.online/questions/1040/pued

相关问题 更多 >