Python：从XPath获取元素值

def xpath_soup(element): components = [] child = element if element.name else element.parent for parent in child.parents: previous = itertools.islice(parent.children, 0,parent.contents.index(child)) xpath_tag = child.name xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1 components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index)) child = parent components.reverse() return '/%s' % '/'.join(components) page = requests.get("https://www.gaumard.com/obstetricmr") html = str(BeautifulSoup(page.content, 'html.parser')) soup = BeautifulSoup(html, 'lxml') elem = soup.find(string=re.compile('xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice')) xPathValue = xpath_soup(elem) print(xPathValue)

2条回答

网友

1楼 · 编辑于 2024-06-09 14:13:58

下面是如何使用XPath获取全文

import requests
from lxml import html

page = requests.get("https://www.gaumard.com/obstetricmr").text
text = html.fromstring(page).xpath('//*[@style="margin: 0 auto;"][2]/div/text()')
print(text[0].strip())

输出：

Obstetric MR™ is a next-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice faster than ever before. Using the latest technology in holographic visualization, Obstetric MR brings digital learning content into the physical simulation exercise, allowing participants to link knowledge and skill through an entirely new hands-on training experience. The future of labor and delivery simulation is here.

网友

2楼 · 编辑于 2024-06-09 14:13:58

特定的XPath不会有多大帮助，因为如前所述，web页面可能会有所不同。搜索文本节点并获取包含该字符串的数组或节点列表的通用XPath可能有助于某些后期处理

在Firefox控制台上尝试：

nodes = $x('//*[contains(text(),"next-generation mixed reality")]', window.document, "nodes");
<- Array [ div ]

nodes[0].textContent;
<- "Obstetric MR™ is a next-generation...(redacted)"

此XPath可以用于其他页面
'//*[contains(text(),"next-generation mixed reality")]'
前提是它们包含next-generation mixed reality字符串

使用python时也一样：

import requests
from lxml import html
url = 'https://www.gaumard.com/obstetricmr'
response = requests.get(url)
html_doc = response.content
xpath0 = '//*[contains(text(),"next-generation mixed reality")]'
result_arr = html.fromstring(html_doc).xpath(xpath0)
result_arr[0].text

输出：

'Obstetric MR™ is a next-generation mixed...'

相关问题更多 >

编程相关推荐

热门问题

热门文章