lxml没有找到Chrome提供的xpath吗？

from lxml import html import requests page = requests.get('https://www.linkedin.com/search/results/companies/?keywords=cisco.com') tree = html.fromstring(page.content) industry = tree.xpath('//*[@id="ember3734"]/div/div[1]/p[1]') print(industry)

2条回答

网友

1楼 · 编辑于 2024-04-19 06:34:50

我用selenium和phantomjs编写了脚本，因为网站使用了大量javascript。在

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html
import re
from selenium import webdriver
from time import sleep
from selenium.webdriver import DesiredCapabilities
from pprint import pprint 

desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
desired_capabilities['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) ' \
                                                                  'AppleWebKit/537.36 (KHTML, like Gecko) ' \
                                                                  'Chrome/39.0.2171.95 Safari/537.36'
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)

username = 'email@email.com'
password = 'password'


# driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://www.linkedin.com")

driver.find_element_by_id('login-password').send_keys(password)
driver.find_element_by_id('login-email').send_keys(username)
driver.find_element_by_id("login-submit").click()
driver.get("https://www.linkedin.com/search/results/companies/?keywords=cisco.com")
sleep(3)
html = driver.page_source
root = lxml.html.fromstring(html)

reg = re.compile('ember-view\">\s+<h3\s+class=\"search\-result__title\s+Sans\-17px\-black\-85\%\-semibold-dense\">(.*?)<\/h3>')
names =  reg.findall(html)

pprint(names)

driver.quit()

网友

2楼 · 编辑于 2024-04-19 06:34:50

我认为这个页面是由JavaScript生成的。由于请求下载页面而不执行JavaScript，因此只能得到主页/模板，而不能获得预期的数据。在

请尝试Chrome下的“查看源页面”进行确认。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

lxml没有找到Chrome提供的xpath吗？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >