如何使用html2text/BeautifulSoup在Python中移除[font]标签
我正在使用BeautifulSoup从我的网站获取结果,这是一段包含很多标签的代码:
<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>
然后我使用html2text来从这段代码中提取出原始文本,方法是:
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True
print h.handle(content) #content is that chunk of code
到目前为止,我得到的最佳结果是:
[font='Times New Roman']THIS[/font][font='Times New Roman'] THIS
[/font][font='Times New Roman']IS[/font][font='Times New
Roman'] A TEST [/font][font='Times New Roman']USING[/font][font='Times New
Roman'] BEAUTIFUL [/font][font='Times New Roman'] SOUP [/font][font='Times New Roman']
| [/font][font='Times New Roman']96786[/font][font='Times New Roman'] AND [/font][font='Times New Roman'] HTML2TEXT [/font][font='Times New Roman'] TO LEARN [/font][font='Times New Roman']NEW THING[/font]
我该如何使用html2text和BeautifulSoup,或者其他方法,去掉[font]标签呢?谢谢!
我的方法是用字符串替换,把[font ...]和[/font]替换成空字符串,但这样似乎效率不高。有没有其他更好的解决办法呢?
1 个回答
1
看起来你的输入是HTML和BBCode的混合。BeautifulSoup和html2text这两个工具都是用来解析和提取HTML中的文本的,但它们不支持BBCode。
一个简单的解决办法是,在用BeautifulSoup或html2text处理之前,先把[font]这种BBCode“标签”转换成HTML格式。你可以使用正则表达式来进行转换,下面有个convert_bbcode_fonts
的例子。(需要注意的是,这并不是把你的输入转换成“有效”的HTML4字体标签,但html2text仍然可以处理这些输入。)
import re
import html2text
def convert_bbcode_fonts(html):
flags = re.IGNORECASE | re.MULTILINE
# replace start font tags
html = re.sub(r'\[font\s*([^\]]+)\]', '<font \1>', html, flags=flags)
# replace end font tags
html = re.sub(r'\[/font\s*\]', '</font>', html, flags=flags)
return html
def extract_text(html):
html = convert_bbcode_fonts(html)
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True
return h.handle(html)
INPUT = """
<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>
"""
if __name__ == '__main__':
print extract_text(INPUT)