网页抓取，无限滚动和提取链接

from PageScroller import WebPageScroller import bs4 as bs sourceUrl='https://www.pakwheels.com/forums/c/travel-n-tours' #----------------------- Scrolling to the bottom of page and getting source code --------------------------------------# scrollObject=WebPageScroller pageSource=scrollObject.getScrolledPageSource(scrollObject,sourceUrl) # ------------------------------------- Getting links ---------------------------------- # soup = bs.BeautifulSoup(pageSource, 'lxml') blogUrls=[] for url in soup.find_all('a'): if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)): blogUrls.append(url.get('href')) print(url.get('href')) print(len(blogUrls))

1条回答

网友

1楼 · 发布于 2024-05-16 19:45:19

用硒刮擦通常是很糟糕的，你会发现一种方法来处理无限卷轴的可能性微乎其微。在

这个站点基本上在https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?_=<uts>处有一个JSON端点，其中<uts>是一个Unix时间戳。在

基本上，这是如何工作的。打开Chrome DevTools或Firebug并加载论坛屏幕。查看Network选项卡。有一个XHR文件，名为latest.json?_=1491493915518。点击它。在

Request URL显示为https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?_=1491493915518。这是你的终点。在

现在您只需要一个Unix时间戳和几行代码：

import requests

current_uts = from_some_unix_timestamp_source  
response = requests.get('https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?_={}'.format(current_uts))
print(response.json())

然后返回该页面上所有内容的JSON表示。如果使用较新的时间戳重新运行同一脚本，它将检索新的论坛线程。如果您想检索旧的线程（或者甚至是刮取整个论坛），您还可以使用旧的Unix时间戳回到过去。我将留给你去想办法把它构建成更健壮的东西。在

相关问题更多 >

编程相关推荐

热门问题

热门文章