直接在https后的URL列中查找单词的部分匹配项://

2024-05-29 05:20:04 发布

您现在位置:Python中文网/ 问答频道 /正文

基本上,我有一个数据框架,其中一列是名称列表,另一列是以某种方式与名称相关联的URL(示例df):

   Name                    Domain
'Apple Inc'             'https://mapquest.com/askjdnas387y1/apple-inc', 'https://linkedin.com/apple-inc/askjdnas387y1/', 'https://www.apple-inc.com/asdkjsad542/'     
'Aperture Industries'   'https://www.cakewasdelicious.com/aperture/run-away/', 'https://aperture-incorporated.com/aperture/', 'https://www.buzzfeed.com/aperture/the-top-ten-most-evil-companies=will-shock-you/'
'Umbrella Corp'         'https://www.umbrella-corp.org/were-not-evil/', 'https://umbrella.org/experiment-death/', 'https://www.most-evil.org/umbrella-corps/'

我正在尝试查找包含关键字或至少与关键字部分匹配的URL,这些URL直接位于以下任一项之后:

'https://NAME.whateverthispartdoesntmatter'  # ...or...
'https://www.NAME.whateverthispartdoesntmatter' # <- not a real link

现在我正在使用fuzzywuzzy包来获得部分匹配:

fuzz.token_set_ratio(name, value)

它对于部分匹配非常有用,但是匹配不是位置相关的,所以我将得到一个完美的关键字匹配,但是它位于URL中间的某个地方,而不是我需要的:

https://www.bloomberg.com/profiles/companies/aperture-inc/0117091D 

Tags: httpsorg名称comurlapplemostwww
1条回答
网友
1楼 · 发布于 2024-05-29 05:20:04

使用explode/unnest stringstr.extract&fuzzywuzzy

首先,我们将使用this函数将您的字符串取消到行:

df = explode_str(df, 'Domain', ',').reset_index(drop=True)

然后,我们使用正则表达式找到两种模式,有或没有www,并从中提取名称:

m = df['Domain'].str.extract('https://www.(.*)\.|https://(.*)\.')
df['M'] = m[0].fillna(m[1])
print(df)


                  Name                                             Domain                      M
0            Apple Inc       https://mapquest.com/askjdnas387y1/apple-inc               mapquest
1            Apple Inc      https://linkedin.com/apple-inc/askjdnas387y1/               linkedin
2            Apple Inc             https://www.apple-inc.com/asdkjsad542/              apple-inc
3  Aperture Industries  https://www.cakewasdelicious.com/aperture/run-...       cakewasdelicious
4  Aperture Industries        https://aperture-incorporated.com/aperture/  aperture-incorporated
5  Aperture Industries   https://www.buzzfeed.com/aperture/the-top-ten...               buzzfeed
6        Umbrella Corp       https://www.umbrella-corp.org/were-not-evil/          umbrella-corp
7        Umbrella Corp             https://umbrella.org/experiment-death/               umbrella
8        Umbrella Corp          https://www.most-evil.org/umbrella-corps/              most-evil

然后我们使用fuzzywuzzy过滤匹配度高于80的行:

from fuzzywuzzy import fuzz

m2 = df.apply(lambda x: fuzz.token_sort_ratio(x['Name'], x['M']), axis=1)

df[m2>80]


            Name                                        Domain              M
2      Apple Inc        https://www.apple-inc.com/asdkjsad542/      apple-inc
6  Umbrella Corp  https://www.umbrella-corp.org/were-not-evil/  umbrella-corp

注意我用token_sort_ratio而不是token_set_ratio来捕捉umbrellaumbrella-corp的差异


从链接答案中使用的函数:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

相关问题 更多 >

    热门问题