Python 从精确位置获取 HTML 元素/节点/标签
我有一个很长的HTML文档,我知道里面某些文字的确切位置。例如:
<html>
<body>
<div>
<a>
<b>
I know the exact position of this text
</b>
<i>
Another text
</i>
</a>
</div>
</body>
</html>
我知道句子“我知道这个文本的确切位置”是从字符编号'x'开始,到字符编号'y'结束。但是我需要获取包含这个值的整个标签/节点/元素,还有可能是它的几个父节点。
我该怎么简单处理这个问题呢?
//编辑
为了更清楚地说明——我唯一拥有的就是一个整数值,它描述了句子的开始位置。
比如说——2048。
我不能假设文档的结构。从某个点开始,我必须一个一个地往上查找父节点。
甚至被指定位置(2048)的句子也不一定是唯一的。
2 个回答
0
你可以把整个HTML文档的内容当作一个字符串来读取。然后,你可以在这个字符串中找到一个标记(就是带有唯一ID的HTML锚点元素),并把这个字符串解析成好像这个标记在原始文档中一样,使用xml.etree.ElementTree
。接着,你可以用XPath找到这个标记的父元素,并把辅助标记去掉。最终的结果就像是原始文档被解析过一样。不过现在你知道了包含文本的元素是什么了!
注意:你需要知道位置是字节位置还是抽象字符位置。(想想多字节编码或者编码某些字符时序列的长度不固定的情况。还要考虑行结束符——可能是一两个字节。)
试试这个例子,假设你问题中的示例存储在data.html
中,并且使用的是Windows的行结束符:
#!python3
import xml.etree.ElementTree as ET
fname = 'doc.html'
pos = 64
with open(fname, encoding='utf-8') as f:
content = f.read()
# The position_id will be used in XPath, the position_anchor
# uses the variable only for readability. The position anchor
# has the form of an HTML element to be found easily using
# the XPath expression.
position_id = 'my_unique_position_{}'.format(pos)
position_anchor = '<a id="{}" />'.format(position_id)
# The modified content has one extra anchor as the position marker.
modified_content = content[:pos] + position_anchor + content[pos:]
root = ET.fromstring(modified_content)
ET.dump(root)
print('----------------')
# Now some examples for getting the info around the point.
# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element
# with the attribute id with the value.
# We will not use it later -- only for demonstration.
anchor_element = root.find('.//a[@id="{}"]'.format(position_id))
ET.dump(anchor_element)
print('----------------')
# The text at the original position -- the text became the tail
# of the element.
print(repr(anchor_element.tail))
print('================')
# Now, from scratch, get the nearest parent from the position.
parent = root.find('.//a[@id="{}"]/..'.format(position_id))
ET.dump(parent)
print('----------------')
# ... and the anchor element (again) as the nearest child
# with the attributes.
anchor = parent.find('./a[@id="{}"]'.format(position_id))
ET.dump(anchor)
print('----------------')
# If the marker split the text, part of the text belongs to
# the parent, part is the tail of the anchor marker.
print(repr(parent.text))
print(repr(anchor.tail))
print('----------------')
# Modify the parent to remove the anchor element (to get
# the original structure without the marker. Do not forget
# that the text became the part of the marker element as the tail.
parent.remove(anchor)
parent.text += anchor.tail
ET.dump(parent)
print('----------------')
# The structure of the whole document now does not contain
# the added anchor marker, and you get the reference
# to the nearest parent.
ET.dump(root)
print('----------------')
它会打印出以下内容:
c:\_Python\Dejwi\so25370255>a.py
<html>
<body>
<div>
<a>
<b>
I know<a id="my_unique_position_64" /> the exact position of this text
</b>
<i>
Another text
</i>
</a>
</div>
</body>
</html>
----------------
<a id="my_unique_position_64" /> the exact position of this text
----------------
' the exact position of this text\n '
================
<b>
I know<a id="my_unique_position_64" /> the exact position of this text
</b>
----------------
<a id="my_unique_position_64" /> the exact position of this text
----------------
'\n I know'
' the exact position of this text\n '
----------------
<b>
I know the exact position of this text
</b>
----------------
<html>
<body>
<div>
<a>
<b>
I know the exact position of this text
</b>
<i>
Another text
</i>
</a>
</div>
</body>
</html>
----------------
1
假设这个例子中的 <b>
是唯一的,你可以使用 XPath
和 xml.etree.elementtree
来处理。
import xml.etree.elementtree as ET
tree = ET.parse('xmlfile')
root = tree.get(root)
myEle = root.findall(".//*[b]")
现在 myEle
将指向 'b' 的父元素,在这个例子中就是 'a'。
如果你只想要 b
元素,可以这样做:
myEle = root.findall(".//b")
如果你想获取 a
的子元素,可以有几种不同的方法:
myEle = root.findall(".//a//")
myEle = root.findall('.//*[a]//*')[1:]
想了解更多关于 XPath 的信息,可以查看这里: XPath