tree.xpath返回空列表

2024-04-27 00:24:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图写一个程序,可以刮一个给定的网站。到目前为止,我有:

from lxml import html
import requests

page = requests.get('https://www.cruiseplum.com/search#{"numPax":2,"geo":"US","portsMatchAll":true,"numOptionsShown":20,"ppdIncludesTaxTips":true,"uiVersion":"split","sortTableByField":"dd","sortTableOrderDesc":false,"filter":null}')

tree = html.fromstring(page.content)

date = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[1]/text()')

ship = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[2]/text()')

length = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[4]/text()')

meta = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[6]/text()')

price = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[7]/text()')

print('Date: ', date)
print('Ship: ', ship)
print('Length: ', length)
print('Meta: ', meta)
print('Price: ', price)

运行此操作时,列表返回空

我对python和一般的编码非常陌生,非常感谢你们能提供的任何帮助

谢谢


Tags: textimportidtruetreedatehtmlpage
2条回答

问题似乎是您导航到的URL。在浏览器中导航到该URL会导致提示,询问您是否要恢复书签搜索

我没有看到一个简单的解决方法。单击“是”将导致javascript操作,而不是使用不同参数的实际重定向

我建议使用类似硒的东西来实现这一点

首先,你使用的链接不正确;这是正确的链接(单击按钮“是”(网站将下载数据并返回到此链接)后):

https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}

其次,当您使用请求获取响应对象时,表中隐藏的内容数据不会返回:

from lxml import html
import requests

u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
r = requests.get(u)
t = html.fromstring(r.content)

for i in t.xpath('//tr//text()'):
    print(i)

这将返回:

Recent update: new computer-optimized interface and new filters
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Date Colors:
(vs. selected)
Lowest Price
Lower Price
Same Price
Higher Price

即使使用html,内容仍然是隐藏的

from requests_html import HTMLSession
session = HTMLSession()
r = session.get(u)

您需要使用selenium访问隐藏的html内容:

from lxml import html
from selenium import webdriver
import time

u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
driver.get(u)

time.sleep(2)

driver.find_element_by_id('restoreSettingsYesEncl').click()
time.sleep(10) #wait until the website downoad data, without this we can't move on

elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")

t = html.fromstring(source_code)

for i in t.xpath('//td[@class="dc-table-column _1"]/text()'):
    print(i.strip())

driver.quit()

返回第一列(容器名称):

Costa Luminosa
Navigator Of The Seas
Navigator Of The Seas
Carnival Ecstasy
Carnival Ecstasy
Carnival Ecstasy
Carnival Victory
Carnival Victory
Carnival Victory
Costa Favolosa
Costa Favolosa
Costa Favolosa
Costa Smeralda
Carnival Inspiration
Carnival Inspiration
Carnival Inspiration
Costa Smeralda
Costa Smeralda
Disney Dream
Disney Dream

如您所见,现在使用selenium的get\u属性(“innerHTML”)访问表中的内容

下一步是刮取行(船只、路线、日期、地区…)并将其存储在csv文件(或任何其他格式)中, 然后对所有4051页都这样做

相关问题 更多 >