无法使用HTML抓取导入数据

import requests from lxml import html page = requests.get('http://www.mysupermarket.co.uk/tesco-price-comparison/Fruit/Tesco_Gala_Apple_Approx_160g.html') tree = html.fromstring(page.content) price_tesco = tree.xpath('//*[@id="PriceWrp"]/div[2]/span') print(price_tesco)

3条回答

网友

1楼 · 编辑于 2024-06-09 08:08:29

可能这个站点是动态的，不允许您获取完整的html文件。在这种情况下，您可以使用“selenium”库，速度稍慢，但总能解决您的问题

网友

2楼 · 编辑于 2024-06-09 08:08:29

因为它是一个javascript呈现页面，所以将requests_html与呈现类似：

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://www.mysupermarket.co.uk/tesco-price- 
comparison/Fruit/Tesco_Gala_Apple_Approx_160g.html')
r.html.render()
price = r.html.xpath('//*[@id="PriceWrp"]/div[2]/span')[0]
print(price.text)

网友

3楼 · 编辑于 2024-06-09 08:08:29

我不能查看有问题的网站（在防火墙后面），但是你应该知道，现在很多网站都有使用javascripts之类的动态内容，并且不能使用基本库正确地获取这些内容，如果你的xpath确实正确，但没有返回任何内容，我假设这里就是这种情况

最好的选择是使用一个库，它可以呈现和刮取这些类型的动态内容，例如selenium或Requests-HTML（我的首选，因为它是无头的）

相关问题更多 >

编程相关推荐

热门问题

热门文章