如何在易趣中迭代页面

2024-04-25 08:18:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为易趣制作一个刮板。我正在试图找到一种方法来操纵Ebay url的页码部分,以转到下一页,直到没有更多的页面为止(如果您在第2页,则页码部分看起来像“_pgn=2”)。我注意到,如果您输入的页面数大于列表的最大页面数,页面将重新加载到最后一页,而不是像页面不存在一样给出错误。(如果一个列表有5个页面,那么最后一个列表的_pgn=5的页码url部分将路由到同一页面,如果页码url部分为_pgn=100)。我如何实现从第一页开始,获取页面的html soup,从soup中获取我想要的所有相关数据,然后用新的页码加载下一页,并再次启动该过程,直到没有任何新的页面可供刮取?我试图通过使用selenium xpath和math.ceil获得一个列表的结果数和50的商(默认每页最大列表数),并将该商用作我的max_页面,但我收到错误消息,说该元素不存在,即使它存在。self.driver.findxpath('xpath').text。这就是我试图用xpath得到的243。 That 243 is what I am trying to get with xpath

class EbayScraper(object):

def __init__(self, item, buying_type):
    self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
    self.driver = webdriver.Chrome(r"chromedriver.exe")
    self.item = item
    self.buying_type = buying_type + "=1"
    self.url_seperator = "&_sop=12&rt=nc&LH_"
    self.url_seperator2 = "&_pgn="
    self.page_num = "1"

def getPageUrl(self):
    if self.buying_type == "Buy It Now=1":
        self.buying_type = "BIN=1"

    self.item = self.item.replace(" ", "+")

    url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def getInfo(self, soup):
    for listing in soup.find_all("li", {"class": "s-item"}):
        raw = listing.find_all("a", {"class": "s-item__link"})
        if raw:
            raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
            raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
            raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
            raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
            condition = raw_condition.text
            price = float(raw_price.text[1:])
            title = raw_title.text
            link = raw_link['href']
            print(title)
            print(condition)
            print(price)
            if self.buying_type != "BIN=1":
                raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
                time_left = raw_time_left.text[:-4]
                print(time_left)
            print(link)
            print('\n')



if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)

Tags: textselfurlrawtypelink页面all
1条回答
网友
1楼 · 发布于 2024-04-25 08:18:19

如果要迭代所有页面并收集所有结果,则脚本需要在访问该页面后检查是否存在next页面

import requests
from bs4 import BeautifulSoup


class EbayScraper(object):

    def __init__(self, item, buying_type):
        ...
        self.currentPage = 1

    def get_url(self, page=1):
        if self.buying_type == "Buy It Now=1":
            self.buying_type = "BIN=1"

        self.item = self.item.replace(" ", "+")
        # _ipg=200 means that expect a 200 items per page
        return '{}{}{}{}{}{}&_ipg=200'.format(
            self.base_url, self.item, self.url_seperator, self.buying_type,
            self.url_seperator2, page
        )

    def page_has_next(self, soup):
        container = soup.find('ol', 'x-pagination__ol')
        currentPage = container.find('li', 'x-pagination__li selected')
        next_sibling = currentPage.next_sibling
        if next_sibling is None:
            print(container)
        return next_sibling is not None

    def iterate_page(self):
        # this will loop if there are more pages otherwise end
        while True:
            page = instance.getPageUrl(self.currentPage)
            instance.getInfo(page)
            if self.page_has_next(page) is False:
                break
            else:
                self.currentPage += 1

    def getPageUrl(self, pageNum):
        url = self.get_url(pageNum)
        print('page: ', url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup

    def getInfo(self, soup):
        ...


if __name__ == '__main__':
    item = input("Item: ")
    buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

    instance = EbayScraper(item, buying_type)
    instance.iterate_page()

这里的重要函数是page_has_nextiterate_page

  • page_has_next-一个函数,用于检查页面的分页是否在selected页面旁边有另一个li元素。e、 g< 1 2 3 >如果我们在第1页,那么它会检查是否有2个next->;类似这样的东西

  • iterate_page-循环直到没有page_next

另外请注意,除非您需要模拟用户点击或需要浏览器导航,否则不需要selenium

相关问题 更多 >