通过python的XPath

2024-05-29 04:19:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用了以下html:

<html> <body> <div class="directions" itemprop="instructions"> <h6>Instructions</h6> <p>Sharpen your <a href="pencil.html" class="crosslink">pencil</a> (or, alternatively, use your pen)</p> <p>In a large paper sheet, write your name. When the ink thickens slightly, gently open the <a href="envelop.html" class="crosslink">envelop</a> and insert the <a href="letter.html" class="crosslink" >letter</a> inside folded into 3. Set aside.</p> <p>Use the pen again to <a href="write.html" class="crosslink">write</a> your name and address into the evelope. Include the destination <a href="address.html" class="crosslink">address</a>.</p> <p>Seal the envelop and stamp it</p> <p class="copyright">Instruction courtesy of John Doe</p> </div> </body> </html>

我期望的结果是得到一个有序的文本元素数组,而不考虑html标记。你知道吗

result=[
'Sharpen your pencil (or, alternatively, use your pen)',
'In a large paper sheet, write your name. When the ink thickens slightly, gently open the envelop and insert the letter inside folded into 3. Set aside',
'Use the pen again to write your name and address into the envelop. Include the destination address',
'Seal the envelop and stamp it'
]

我正在使用python解析html并获取所需的信息片段。与树.xpath(''//@[itemprop=“instructions”]')我正在获取所需的元素。但我似乎不能以我想要的方式得到信息。你知道吗

我最近的尝试(仍然失败)如下:

for a in tree.xpath('//*[@itemprop="instructions"]'):
    for i in a.xpath('./p'):
        temptext = ""
        for c in i.xpath('text()'):
            temptext += c
        for c in i.xpath('./a'):
            temptext += c.text
        tempIteration.append(temptext)

为清晰起见,请编辑:

这会得到一个不正确的结果(“a”节点文本的顺序错误)。 看铅笔是在元素1的末尾,而不是在“削尖你的铅笔”之后。同样的事情也发生在剩下的队伍中。你知道吗

result=[
'Sharpen your (or, alternatively, use your pen)pencil',
'In a large paper sheet, write your name. When the ink thickens slightly, gently open the and insert the inside folded into 3. Set asideenvelopletter',
'Use the pen again to your name and address into the envelop. Include the destination writeaddress',
'Seal the envelop and stamp it',
'Instruction Courtesy of John Doe'
]

我还没有能够得到这个工作,所以任何帮助将不胜感激。你知道吗


Tags: andthenameyouraddresshtmlxpathclass
2条回答

您可以使用getchildren()方法和元素的texttail属性。 我从未使用过lxml,但从文档here中我得到了它,可以在下面的示例中使用。你知道吗

from lxml import etree

html='''<html>
<body>
<div class="directions" itemprop="instructions">
<h6>Instructions</h6>
<p>Sharpen your <a href="pencil.html" class="crosslink">pencil</a> (or, alternatively, use your pen)</p>
<p>In a large paper sheet, write your name. When the ink thickens slightly, gently open the <a href="envelop.html" class="crosslink">envelop</a> and insert the <a href="letter.html" class="crosslink" >letter</a> inside folded into 3. Set aside.</p>
<p>Use the pen again to <a href="write.html" class="crosslink">write</a> your name and address into the evelope. Include the destination <a href="address.html" class="crosslink">address</a>.</p>
  <p>Seal the envelop and stamp it</p>
<p class="copyright">Instruction courtesy of John Doe</p>
</div>
</body>
  </html>'''

tree=etree.HTML(html)
result=[]
for a in tree.xpath('//*[@itemprop="instructions"]'):
    for i in a.xpath('./p'):
        temptext = ""
        temptext += i.text
        for j in i.getchildren():
            temptext += j.text
            temptext += j.tail
        result.append(temptext)

print result

这就产生了

[
'Sharpen your pencil (or, alternatively, use your pen)', 
'In a large paper sheet, write your name. When the ink thickens slightly, gently open the envelop and insert the letter inside folded into 3. Set aside.', 
'Use the pen again to write your name and address into the evelope. Include the destination address.', 
'Seal the envelop and stamp it', 
'Instruction courtesy of John Doe'
]

然后你可以做result[:-1]扔掉最后一个

不确定这是否有帮助,我对XPATH的了解是有限的,但这可能是因为您没有关闭<div class="directions" itemprop="instructions">元素吗?你知道吗

你不应该有这个:

<html>
<body>
    <div class="directions" itemprop="instructions">
        <h6>Instructions</h6>
        <p>Sharpen your <a href="pencil.html" class="crosslink">pencil</a> (or, alternatively, use your pen)</p>
        <p>In a large paper sheet, write your name. When the ink thickens slightly, gently open the <a href="envelop.html" class="crosslink">envelop</a> and insert the <a href="letter.html" class="crosslink" >letter</a> inside folded into 3. Set aside.</p>
        <p>Use the pen again to <a href="write.html" class="crosslink">write</a> your name and address into the evelope. Include the destination <a href="address.html" class="crosslink">address</a>.</p>
        <p>Seal the envelop and stamp it</p>
    </div>
    <p class="copyright">Instruction courtesy of John Doe</p>
</body>
</html>

注意,我添加了</div>

希望这有帮助:)

相关问题 更多 >

    热门问题