当我有内容时如何删除BeautifulSoup中的HTML标记

2024-04-19 23:34:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图获取的html:

<div id="unitType"> <h2>BB100 <br>v1.4.3</h2> </div>

我有下面一个h2标记的内容:

initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
    print i

哪些输出:

('Device Info: ', [u'BB100 ', <br>v1.4.3</br>])
BB100
<br>v1.4.3</br>

如何使用BeautifulSoup而不是regex移除<h2></h2><br></br>html标记?我试过i.decompose()i.strip()但都没用。它会抛出'NoneType' object is not callable


Tags: 标记brdivinfoid内容devicehtml
2条回答

只需使用find andextractthebr标记:

In [15]: from bs4 import BeautifulSoup
    ...: 
    ...: h = """<div id='unitType'><h2>BB10<br>v1.4.3</h2></d
    ...: iv>"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
    ...: h2.find("br").extract()
    ...: print(h2)
    ...: 
<h2>BB10</h2>

或者使用replace-with将标记替换为文本:

In [16]: from bs4 import BeautifulSoup
    ...: 
    ...: h = """<div id='unitType'><h2<br>v1.4.3 BB10</h2></d
    ...: iv>"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
    ...: 
    ...: br = h2.find("br")
    ...: br.replace_with(br.text)
    ...: print(h2)
    ...: 
<h2>v1.4.3 BB10</h2>

要删除h2并保留文本:

In [37]: h = """<div id='unitType'><h2><br>v1.4.3</h2></d
    ...: 
    ...: iv>"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: unit = soup.find(id="unitType")
    ...: 
    ...: h2 = unit.find("h2")
    ...: h2.replace_with(h2.text)
    ...: print(unit)
    ...: 
<div id="unitType">v1.4.3 BB10</div>

如果你只想"v1.4.3""BB10",有很多方法可以和他们打交道:

In [60]: h = """<div id="unitType">
    ...:      <h2>BB100 <br>v1.4.3</h2>
    ...:  </div>"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
        # just find all strings
    ...: a,b = h2.find_all(text=True)
    ...: print(a, b)
         # get the br
    ...: br = h2.find("br")
        # get br text and just the h2 text ignoring any text from children
    ...: a, b = h2.find(text=True, recursive=False),  br.text
    ...: print(a, b)
    ...: 
BB100  v1.4.3
BB100  v1.4.3

为什么你最后会收到短信

您可以检查元素是否是带有if i.name == 'br'<br>标记,然后只需将列表更改为包含内容。

for i in deviceInfo:
    if i.name == 'br':
        i = i.contents

如果需要多次迭代,请修改列表。

for n, i in enumerate(deviceInfo):
    if i.name == 'br':
        i = i.contents
        deviceInfo[n] = i

相关问题 更多 >