正则表达式,用于从dict-like字符串中提取所有url

2024-04-28 04:45:05 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我要从中提取URL的字符串

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

我尝试的代码到现在只打印

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', s)

但它只打印这个网址的重复

    ['https://www.riteaid.com']

Tags: 字符串httpscomurlwwwshopctpharmacy
2条回答

如果必须对当前示例使用regex来匹配{url:''},则可以使用正的lookbehind(?<=和正的先行(?=,并使用与'不匹配的反字符类[^']+来匹配url一次或多次

^{}

Demo

您还可以对示例数据限制较少,并省去前导{和尾随}

^{}

正如您提到的dict-like-string,您必须为您的特定情况使用regex,这是可以使用的

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

urls = re.findall(r"url:'(https?://.*?)'}", s)

result:
['https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442',
 'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009',
 'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249',
 'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568']

解释

网址:'(http:文字字符串

s?:可选文字字符“s”

*?:非贪婪的任何字符

'}::文字字符串

相关问题 更多 >