如何从URL中提取参数?

2024-06-01 02:50:46 发布

您现在位置:Python中文网/ 问答频道 /正文

url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)

Output : ['world-cuisine', 'asian', 'chinese', ''] 
         ['soups-stews-and-chili', '']
  • 我需要的输出是[‘世界美食’、‘亚洲菜’、‘中国菜’]&;[“炖汤加辣椒”]
  • URL有不同的参数,我无法绕过所有URL,只提取数字后面的主要参数
  • 此外,URL末尾的“/”是必需的,因为在Scrapy中,当我使用URL w/o“/”时,它会抛出一个301错误,但正如您从输出中看到的,由于反斜杠,有一个额外的“/”,我不能忽略它
  • 我可以做什么来获取各种URL的参数

URL的其他一些示例包括:

"https://www.allrecipes.com/recipes/416/seafood/fish/salmon/"

"https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/"

  • 我们如何编写规则来遵循此类URL的分页https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2"

    规则(LinkExtractor(allow=(r'recipes/?page=\d+),follow=True)

我对scrapy和regex是新手,因此我非常感谢您在这个问题上的帮助


Tags: andhttpscomurlworldwwwrecipeschinese
3条回答

我不能100%确定我是否正确理解了您的问题,但我认为下面的代码可以满足您的需要

编辑
注释交互后更新的代码

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
    'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]

for url in urls:
    for index, part in enumerate(url.split('/')):
        if part.isnumeric():
            start = index+1
            break
    print(url.split('/')[start:-1])

输出

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']

旧答案

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]

for url in urls:
    print(url.split("/")[5:-1])

输出

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']

像这样的。其思想是找到“int”路径元素并从其右侧获取所有路径元素

from collections import defaultdict
from typing import Dict, List

urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
        'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']


def is_int(param: str) -> bool:
    try:
        int(param)
        return True
    except ValueError:
        return False


data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
    elements = url.split('/')
    elements.reverse()
    loop = True
    while loop:
        for element in elements:
            if len(element.strip()) < 1:
                continue
            if not is_int(element):
                data[url].append(element)
            else:
                loop = False
                break
print(data)

输出

defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})

您可以组合re模块+str.split

import re

urls = [
    "https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
    "https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
    "https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
    "https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]

r = re.compile(r"(?:\d+/)(.*)/")

for url in urls:
    print(r.search(url).group(1).split("/"))

印刷品:

['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']

相关问题 更多 >