如何让BeautifulSoup解析textarea标签的内容为HTML？

5 投票

2 回答

6083 浏览

提问于 2025-04-15 21:46

在3.0.5版本之前，BeautifulSoup会把<textarea>里的内容当作HTML来处理。但现在它把这些内容当作普通文本来处理。我正在解析的文档中，textarea标签里有HTML内容，我想对它进行处理。

我试过：

    for textarea in soup.findAll('textarea'):
        contents = BeautifulSoup.BeautifulSoup(textarea.contents)
        textarea.replaceWith(contents.html(text=True))

但是我遇到了错误。我在文档里找不到相关信息，其他的解析器也没帮上忙。有没有人知道我该怎么把textarea里的内容当作HTML来解析？

补充说明：

这里有个示例HTML：

<textarea class="ks-lazyload-custom">
  <div class="product-view product-view-rug">
    Foobar Womble
    <div class="product-view-head">
      <img src="tps/i1/fo-25.gif" />
    </div>
  </div>
</textarea>

错误信息是：

File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, 
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer

我想找到一种方法，从一个元素中提取内容，用BeautifulSoup解析这些内容，然后把解析后的文本替换回原来的元素（或者直接替换整个元素）。

至于实际情况和规范之间的关系，这里其实不是特别重要。数据需要被解析，我在寻找实现这个的办法。

错误处理文本处理解析器数据提取 html解析 beautifulsoup 元素替换 textarea标签

2 个回答

我现在正在使用以下代码，这段代码大部分情况下都能正常工作。不过你的情况可能会有所不同。

def _extractText(self, data, encoding):
    if self.isDebug: self._output("_extractText")
    soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding)
    comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment))
    [comment.extract() for comment in comments]
    [script.extract() for script in soup.findAll('script')]
    [css.extract() for css in soup.findAll('style')]
    for textarea in soup.findAll('textarea'):
        textarea.string = self._extractText(textarea.renderContents(), 'UTF-8')
    text = unicode('')
    for line in soup.findAll(text=True):
        line = line.replace('&nbsp;', ' ').strip()  
        if line == '': continue
        if line.startswith('doctype'): continue
        if line.startswith('DOCTYPE'): continue
        text = text + line + '\n'
    return text

回答于 2025-04-15 由 Python大师

分享举报

这看起来运行得不错（如果我理解你想要的没错的话）：

for textarea in soup.findAll('textarea'):
    contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents()
    textarea.replaceWith(contents)

回答于 2025-04-15 由 Python大师

分享举报

如何让BeautifulSoup解析textarea标签的内容为HTML？

2 个回答

撰写回答