无法从网站获取标题,同时点击下一页按钮

2024-04-26 00:50:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经用pythonselenium组合编写了一个脚本,在单击“下一页”按钮的同时,从不同页面中刮取不同文章的链接,并从每个文章的内页中获取标题。尽管我在这里尝试处理的内容是静态的,但我使用selenium查看它在单击下一页时如何解析项目。I'm only after any soultion related to selenium.

Website address

如果我定义了一个空白列表并扩展了所有的链接,那么最终我可以解析所有的标题,当点击下一页按钮时,从它们的内页重用这些链接,但这不是我想要的。你知道吗

However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. In short, I wish do the two things simultaneously.

我试过:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)

        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            driver.execute_script("arguments[0].scrollIntoView();",elem)
            elem.click()
            time.sleep(2)
        except Exception:
            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for item in get_links(link):
        print(item)

当我运行上面的脚本时,它会通过重用第一页的链接来解析不同文章的标题,但是会抛出这个错误raise TimeoutException(message, screen, stacktrace) 当它碰到这个elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))行时。你知道吗

如何从每个帖子的内页中提取标题从第一页收集链接,然后单击“下一页”按钮以重复此过程,直到完成为止?


Tags: ofthefromimport标题getby链接
1条回答
网友
1楼 · 发布于 2024-04-26 00:50:59

之所以没有“下一步”按钮,是因为遍历循环末尾的每个内部链接时,它找不到“下一步”按钮。你知道吗

你需要采取下面的每一个步骤并执行。你知道吗

urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(pageno) #where page will start from 2

试试下面的代码。你知道吗

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
    npage = 2
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)
        driver.get(urlnext.format(npage))
        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            npage=npage+1
            time.sleep(2)
        except Exception:

            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)

    for item in get_links(link):
        print(item)

相关问题 更多 >