我使用了以下html:
<html>
<body>
<div class="directions" itemprop="instructions">
<h6>Instructions</h6>
<p>Sharpen your <a href="pencil.html" class="crosslink">pencil</a> (or, alternatively, use your pen)</p>
<p>In a large paper sheet, write your name. When the ink thickens slightly, gently open the <a href="envelop.html" class="crosslink">envelop</a> and insert the <a href="letter.html" class="crosslink" >letter</a> inside folded into 3. Set aside.</p>
<p>Use the pen again to <a href="write.html" class="crosslink">write</a> your name and address into the evelope. Include the destination <a href="address.html" class="crosslink">address</a>.</p>
<p>Seal the envelop and stamp it</p>
<p class="copyright">Instruction courtesy of John Doe</p>
</div>
</body>
</html>
我期望的结果是得到一个有序的文本元素数组,而不考虑html标记。你知道吗
result=[
'Sharpen your pencil (or, alternatively, use your pen)',
'In a large paper sheet, write your name. When the ink thickens slightly, gently open the envelop and insert the letter inside folded into 3. Set aside',
'Use the pen again to write your name and address into the envelop. Include the destination address',
'Seal the envelop and stamp it'
]
我正在使用python解析html并获取所需的信息片段。与树.xpath(''//@[itemprop=“instructions”]')我正在获取所需的元素。但我似乎不能以我想要的方式得到信息。你知道吗
我最近的尝试(仍然失败)如下:
for a in tree.xpath('//*[@itemprop="instructions"]'):
for i in a.xpath('./p'):
temptext = ""
for c in i.xpath('text()'):
temptext += c
for c in i.xpath('./a'):
temptext += c.text
tempIteration.append(temptext)
为清晰起见,请编辑:
这会得到一个不正确的结果(“a”节点文本的顺序错误)。 看铅笔是在元素1的末尾,而不是在“削尖你的铅笔”之后。同样的事情也发生在剩下的队伍中。你知道吗
result=[
'Sharpen your (or, alternatively, use your pen)pencil',
'In a large paper sheet, write your name. When the ink thickens slightly, gently open the and insert the inside folded into 3. Set asideenvelopletter',
'Use the pen again to your name and address into the envelop. Include the destination writeaddress',
'Seal the envelop and stamp it',
'Instruction Courtesy of John Doe'
]
我还没有能够得到这个工作,所以任何帮助将不胜感激。你知道吗
您可以使用
getchildren()
方法和元素的text
和tail
属性。 我从未使用过lxml
,但从文档here中我得到了它,可以在下面的示例中使用。你知道吗这就产生了
然后你可以做
result[:-1]
扔掉最后一个不确定这是否有帮助,我对XPATH的了解是有限的,但这可能是因为您没有关闭
<div class="directions" itemprop="instructions">
元素吗?你知道吗你不应该有这个:
注意,我添加了
</div>
希望这有帮助:)
相关问题 更多 >
编程相关推荐