如何使用pandas.replace（）和regex列表，同时遵守列表顺序？

3条回答

网友

1楼 · 编辑于 2024-04-19 11:35:18

下面是一种使用双列表理解和re.sub()函数来完成此操作的方法：

import re

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
                       in A[['regex', 'wildcards']].values
                       if re.sub(to_replace, value, x)!=x], key=len) 
                  for x in B['server_hostname']]

^{pr2}$

网友

2楼 · 编辑于 2024-04-19 11:35:18

大多数答案使用apply()，这比内置的向量函数解要慢。我希望使用.replace()，因为它是一个内置的向量函数，所以它会很快。@vlemaistre的答案是唯一一个不使用.apply()的方法，我的解决方案不是将每个通配符编译成regex，而是将其视为使用逻辑的右手子串：“如果server_hostname以wildcard结尾，那么它就是匹配的”。只要我按长度对通配符进行排序，它就可以正常工作。在

我的功能是：

def match_to_whitelist(accepts_df, whitelist_df):
    """ Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
    accepts_df.loc[:, 'wildcards'] = None
    for wildcard in whitelist_df['wildcards']:
        accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
            accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
    rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
    return accepts_df

这里，accepts_df与之前的B相似，whitelist_df与之前的A相似，但有两个区别：

无regex列
wildcards值不再是glob/regex格式（即“（.*）微软.com“变成”微软.com““

为了在我的机器上对答案进行基准测试，我将使用我的作为基线，用27秒的时间处理100kaccepts_df行和400whitelist_df行。使用同一个数据集，下面是其他解决方案的时间（我很懒：如果它们没有跑出大门，我就没有调试太多东西来发现）：

@向量函数列表理解：193秒
@user214-SequenceMatcher:234秒
@aws_学徒-比较搜索结果长度：24秒
@fpersyn-First match（如果A排序，则为最佳匹配）：超过6分钟，因此退出。。。在
@andyhayden-lastgroup：没有测试，因为我不能（快速）构建一个长的重新编程。在
@capelastegui-Series.str.match()：错误：“pandas.core.索引.base.invalidIndex错误：仅对值唯一的索引对象重新编制索引”

最终，我们的答案中没有一个能说明如何按需要使用.replace()，所以暂时，我将这个问题留待几个星期，以防有人能提供更好地使用.replace()的答案，或者至少是其他一些基于向量的快速解决方案。在那之前，我会保留我所拥有的，或者在我验证结果后使用aws_学徒的。在

编辑我改进了匹配器，在两个DFs中添加了一个“domain”列，它由每个通配符/服务器主机名（即微软网站变成“微软.com"). 然后，我在两个DF上使用groupby('domain')，遍历域的白名单组，从服务器主机名DF（B）获取同一个域组，并使用每个组中通配符/服务器主机名的子集进行匹配。这把我的处理时间缩短了一半。在

网友

3楼 · 编辑于 2024-04-19 11:35:18

另一种方法是使用SequenceMatcher和{a2}。在

从@vlemaistre给出的答案中获取数据

from difflib import SequenceMatcher
import pandas as pd
import re

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com', 'www.i.microsoft.com']})

def regex_match(x):
    match = None
    ratio = 0
    for w, r in A[['wildcards', 'regex']].to_numpy():
        if re.match(r, x) is not None:
            pct = SequenceMatcher(None, w, x).ratio()
            if ratio < pct: ratio = pct; match = w
    return match

B['wildcards'] = B.server_hostname.apply(regex_match)

# print(B.wildcards)
0    (.*)v10.vortex-win.data.microsoft.com
1                        (.*)microsoft.com
2                      (.*)i.microsoft.com
Name: server_hostname, dtype: object

相关问题更多 >

编程相关推荐

热门问题

热门文章