使用Python将HTML呈现为纯文本

<div> <p> Some text <span>more text</span> even more text </p> <ul> <li>list item</li> <li>yet another list item</li> </ul> </div> <p>Some other text</p> <ul> <li>list item</li> <li>yet another list item</li> </ul>

def parse_text(contents_string) Newlines = re.compile(r'[\r\n]\s+') bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) txt = bs.getText('\n') return Newlines.sub('\n', txt)

2条回答

网友

1楼 · 编辑于 2024-05-13 20:47:07

我在尝试解析呈现的HTML时遇到了同样的问题。基本上看来BS并不是这方面的理想包。@Del提供了伟大的html2text解决方案。

关于一个不同的问题：BeautifulSoup get_text does not strip all tags and JavaScript@Helge提到使用nltk。不幸的是，nltk似乎正在停止这种方法。

我尝试了html2text和nltk.clean_html，并对计时结果感到惊讶，因此认为它们为子孙后代提供了一个答案。当然，速度很大程度上取决于数据的内容。。。

@Helge（nltk）回复。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

使用呈现的html返回字符串非常有效。这个nltk模块甚至比html2text快，尽管html2text可能更健壮。

上面的答案来自@del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

网友

2楼 · 编辑于 2024-05-13 20:47:07

BeautifulSoup是一个scraping库，因此它可能不是进行HTML渲染的最佳选择。如果没有必要使用BeautifulSoup，您应该看看^{}。例如：

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

这将输出：

Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item

相关问题更多 >

编程相关推荐

热门问题

热门文章