如何在python中删除html标记

2024-06-11 17:36:05 发布

您现在位置:Python中文网/ 问答频道 /正文

[<li style="text-align: left;">
<span style="line-height: 19px;">
For Female/SC/ST/ PH: <strong>NIL</strong></span></li>,
<li style="text-align: left;">
<span style="line-height: 19px;">For Others:
<strong>Rs. 200/-</strong></span></li>,
<li style="text-align: left;">
Candidates can pay either by depositing the money in any Branch 
of SBI by cash or by using net banking facility of SBI.</li>]

预期结果如下:

For Female/SC/ST/ PH:NIL,For Others:
Rs. 200/-,    Candidates can pay either by depositing the money in any Branch 
of SBI by cash or by using net banking facility of SBI.

在python中,如何从上面的字符串中删除所有标记。你知道吗


Tags: oftextforbystylelinelileft
2条回答

试试这个

from bs4 import BeautifulSoup


html = "<li style="text-align: left;">
<span style="line-height: 19px;">
For Female/SC/ST/ PH: <strong>NIL</strong></span></li>,
<li style="text-align: left;">
<span style="line-height: 19px;">For Others:
<strong>Rs. 200/-</strong></span></li>,
<li style="text-align: left;">
Candidates can pay either by depositing the money in any Branch 
of SBI by cash or by using net banking facility of SBI.</li>"

soup = BeautifulSoup(html,'html.parser')
text = soup.get_text()
print(text)

有很多HTML解析库可以实现这一点,比如BeautifulSoup。另一种选择(我仍然建议BeautifulSoup,请参阅Saikrishna Rajaraman的答案)是使用带有re.sub()的正则表达式,其中s是输入字符串,如下所示:

re.sub(r'<.*?>', '', s)

这将产生:

For Female/SC/ST/ PH: NIL,

For Others:
Rs. 200/-,

Candidates can pay either by depositing the money in any Branch 
of SBI by cash or by using net banking facility of SBI.

如果您的HTML恰好存储在列表中,您可以执行以下操作(注意转换为str):

[re.sub(r'<.*?>', '', str(s) for s in myList]

相关问题 更多 >