在每次迭代中给出第一页的分页

2024-03-28 13:10:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图刮分页网页,但它给了我第一个 每次迭代中的页面。当我在浏览器中单击它时 内容不同。你知道吗

url = "http://www.x.y/z/a-b#/page-%s"

for i in range(1, 10):
  url2 = url % str(i)
  soup = urlToSoup(url2)
  print url2
  # url2 changes in every iteration
  # Here it will print the same product list in every iteration

这是输出:

http://www.x.y/z/a-b#/page-1
http://www.x.y/z/a-b#/page-2
http://www.x.y/z/a-b#/page-3
http://www.x.y/z/a-b#/page-4
http://www.x.y/z/a-b#/page-5
http://www.x.y/z/a-b#/page-6
http://www.x.y/z/a-b#/page-7
http://www.x.y/z/a-b#/page-8
http://www.x.y/z/a-b#/page-9

第2页的寻呼机项(类似于第3、4、…)如下所示

<a rel="nofollow" href="http://www.x.y/z/a-b#/page-2"> <span>2</span> </a>

为什么在浏览器中打开URL(通过单击或通过地址栏)和通过代码获取URL时生成的页面不同?你知道吗


Tags: inhttpurl网页内容wwwpage浏览器
1条回答
网友
1楼 · 发布于 2024-03-28 13:10:44

您正在将文本添加到“片段标识符”(即在#之后)请参见https://www.w3.org/DesignIssues/Fragment.html

The fragment identifier is a string after URI, after the hash, which identifies something specific as a function of the document. For a user interface Web document such as HTML poage, it typically identifies a part or view. For example in the object

RFC3986

the fragment identifier is separated from the rest of the URI prior to a dereference, and thus the identifying information within the fragment itself is dereferenced solely by the user agent, regardless of the URI scheme. Although this separate handling is often perceived to be a loss of information, particularly for accurate redirection of references as resources move over time, it also serves to prevent information providers from denying reference authors the right to refer to information within a resource selectively. Indirect referencing also provides additional flexibility and extensibility to systems that use URIs, as new media types are easier to define and deploy than new schemes of identification.

因此,您将索引添加到未发送到服务器的URL的一部分。它仅供客户端使用,“仅由用户代理解除引用”。服务器每次迭代都会看到相同的URL。你知道吗

最有可能呈现页面的方式是,有一些JavaScript读取片段标识符并发出另一个请求来获取数据或确定要显示的数据部分。你知道吗

我建议使用livehttp头或其他工具检查页面发出的所有请求,看看是否有第二个请求可以使用或使用JavaScript呈现技术,如Selenium、dryscrape或PyQT5,有关详细信息,请参阅我对Scraping Google Finance (BeautifulSoup)的回答。你知道吗

相关问题 更多 >