Beautiful Soup - 如何修复损坏的标签

2 投票
2 回答
2153 浏览
提问于 2025-04-17 02:48

我想知道在用Beautiful Soup解析HTML之前,怎么修复那些坏掉的HTML标签。

在下面的脚本中,td>需要被替换成<td

我该怎么做这个替换,让Beautiful Soup能识别它呢?

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right

2 个回答

2

如果你只关心这个 td> -> 的问题,可以试试:

myString = re.sub('td>', '<td>', myString)

在把 myString 发送给 BeautifulSoup 之前。如果还有其他的标签出错,给我们一些例子,我们可以一起解决 : )

2

编辑(有效):

我从w3网站上找到了一个完整的(至少应该是完整的)HTML标签列表,用来进行匹配。你可以试试:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

结果是:

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

这个也应该能匹配到不完整的结束标签(比如 </endtag>):

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)

撰写回答