我对XPath和网络抓取非常陌生,如果这是一个相对较小的问题,我很抱歉。为了确保数据库中的数据得到更新,我正在尝试清理多个网站。我能够获取部分字符串的xPath,但不确定如何使用xPath获取完整值
代码:
def xpath_soup(element):
components = []
child = element if element.name else element.parent
for parent in child.parents:
previous = itertools.islice(parent.children, 0,parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
page = requests.get("https://www.gaumard.com/obstetricmr")
html = str(BeautifulSoup(page.content, 'html.parser'))
soup = BeautifulSoup(html, 'lxml')
elem = soup.find(string=re.compile('xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice'))
xPathValue = xpath_soup(elem)
print(xPathValue)
我试图使用xPathValue
获取元素的完整值
预期结果将是完整版本的
xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice
存在
Obstetric MR™ is a next-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice faster than ever before. Using the latest technology in holographic visualization, Obstetric MR brings digital learning content into the physical simulation exercise, allowing participants to link knowledge and skill through an entirely new hands-on training experience. The future of labor and delivery simulation is here.
这个完整值将来自于利用xPathValue
下面是如何使用
XPath
获取全文输出:
特定的XPath不会有多大帮助,因为如前所述,web页面可能会有所不同。 搜索文本节点并获取包含该字符串的数组或节点列表的通用XPath可能有助于某些后期处理
在Firefox控制台上尝试:
此XPath可以用于其他页面
'//*[contains(text(),"next-generation mixed reality")]'
前提是它们包含
next-generation mixed reality
字符串使用python时也一样:
输出:
相关问题 更多 >
编程相关推荐