无法从网页的不同深度刮出类似的链接

import requests from urllib.parse import urljoin from bs4 import BeautifulSoup URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0" def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")] return items if __name__ == '__main__': for item in get_links(URL): print(item)

3条回答

网友

1楼 · 编辑于 2024-04-26 09:25:38

似乎“顶级体验”和“更多体验”链接共享同一类，因此您可以使用.find_all来获取链接。你知道吗

import requests
#from urllib.parse import urljoin
from bs4 import BeautifulSoup

# URL to scrape
url = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

# Make request and Initialize BS4 with request content
req = requests.get(url)
soup = BeautifulSoup(req.content, "lxml")

# Tag that contains "Top Experiences" and "More Experiences"
soup.find_all(class_="_l8g1fr")

# Test Code
#Prints title of links and the href
links = soup.find_all(class_="_l8g1fr")
for link in links:
    print(link.find("a").get_text())
    print(link.find("a").get('href'))

重构代码以满足您的编码范式。你知道吗

网友

2楼 · 编辑于 2024-04-26 09:25:38

过程：

获取所有Top Experiences链接
获取所有More Experiences链接
向所有More Experiences链接逐个发送请求，并获取每个页面Experiences下的链接。

链接所在的div是相同的，因为所有页面都具有相同的类_12kw8n71

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from time import sleep
from random import randint
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
top_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')]
more_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[1].find_all('a')]
generated_experiences=[]
#visit each link in more_experiences
for url in more_experiences:
    sleep(randint(1,10))#avoid blocking by putting some delay
    generated_experiences.extend([urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')])

注意事项：

您所需的链接将出现在三个列表中top_experiences、more_experiences和generated_experiences
我添加了随机延迟以避免被阻塞。
不要打印列表，因为它太长了。你知道吗
top_experiences-50个链接
more_experiences-299链接
generated_experiences-14950个链接

网友

3楼 · 编辑于 2024-04-26 09:25:38

解决方法有点棘手。它可以通过几种方式实现。我发现最有用的是递归地使用More Experiences函数中get_links()下的链接。More Experiences下的所有链接都有一个公共关键字_pdp-。你知道吗

因此，当您在函数中定义conditional语句以使链接递归地通过函数get_links()进行筛选时，else块将生成所需的链接。最需要注意的是，所有需要的链接都在类_1f0v6pq中，因此获取链接的逻辑相当简单。你知道吗

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq"):
        if "_pdp-" in item.get("href"):
            get_links(urljoin(URL,item.get("href")))
        else:
            print(urljoin(URL,item.get("href")))

if __name__ == '__main__':
    get_links(URL)

相关问题更多 >

编程相关推荐

热门问题

热门文章