如何从HTML树中分离标签

0 投票
4 回答
2163 浏览
提问于 2025-04-17 09:51

这是我的HTML结构

 <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>

从这个HTML中,我需要提取<br>标签之前的内容

第一行:申请印度石油花旗银行信用卡。现在就申请!

第二行:购物享受10倍奖励 - 燃油省超过5%

这在Python中应该怎么做呢?

4 个回答

1

这个解决方案不依赖于 <br> 标签:

import lxml.html

html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[@class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[@class="taf"]//a[not(@id)]/text()'))
1

在解析HTML的时候,我们需要对HTML的格式做一些假设。如果我们可以假设前面的那一行是所有在<br>标签之前的内容,直到遇到一个块级标签或者另一个<br>标签,那么我们可以这样做……

from BeautifulSoup import BeautifulSoup

doc = """
   <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
    </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
    <br />
    <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
    <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
    <br />
    <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""

soup = BeautifulSoup(doc)

现在我们已经解析了HTML,接下来我们要定义一个不想当作行内容的标签列表。其实还有其他的块级标签,但对于这个HTML来说,这些就够用了。

block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]

我们会遍历每一个<br>标签,向后查找它的兄弟节点,直到没有更多的节点,或者遇到一个块级标签。每次循环时,我们都会把节点添加到行的前面。NavigableStrings没有name属性,但我们想要把它们包含在内,所以在while循环中做了两个部分的测试。

for node in soup.findAll("br"):
    line = ""
    sibling = node.previousSibling
    while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
        line = unicode(sibling) + line
        sibling = sibling.previousSibling
    print line
4

我觉得你刚才问的是在每个 <br/> 标签之前的那一行。

下面的代码可以处理你提供的示例,它会去掉 <b><a> 标签,并打印出每个元素的 .tail,这些元素的下一个兄弟元素是 <br/>

from lxml import etree

doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")

etree.strip_tags(doc,'a','b')

for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
  print repr(element.tail.strip())

输出结果是:

'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n   Save Over 5% On Fuel'

撰写回答