当使用lxml.html解析html时，等同于InnerHTML

from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """<body> <h1>A title</h1> <p>Some text</p> Untagged text <p> Unclosed p tag </body>""")) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

3条回答

网友

1楼 · 编辑于 2024-05-23 15:38:17

import lxml.etree as ET

     body = t.xpath("//body");
     for tag in body:
         h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
         p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");             
         htext = h[0].text_content();
         ptext = h[0].text_content();

您还可以使用.get('href')作为标记，使用.attrib作为属性

这里的标记no是硬编码的，但是您也可以动态地执行此操作

网友

2楼 · 编辑于 2024-05-23 15:38:17

可以使用根节点的getchildren（）或iterdescendants（）方法获取ElementTree节点的子节点：

>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
...  print etree.tostring(child)
...
<h1>A title</h1>

<p>Some text</p>

这可能会出现以下情况：

print ''.join([etree.tostring(child) for child in root.iterdescendants()])

网友

3楼 · 编辑于 2024-05-23 15:38:17

很抱歉再次提出此问题，但我一直在寻找解决方案，而您的解决方案包含一个错误：

<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>

将忽略根元素正下方的文本。我最后这样做了：

(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])

相关问题更多 >

编程相关推荐

热门问题

热门文章