如何用Python从HTML中提取可读文本？

4 投票

4 回答

5090 浏览

提问于 2025-04-16 00:46

我知道一些工具，比如html2text和BeautifulSoup等，但问题是它们也会提取页面上的JavaScript代码，把这些代码也加到文本里，这样就很难把它们分开。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

另外，

from stripogram import html2text
extract = html2text(webPage)

这两个工具都会把页面上的所有JavaScript代码提取出来，这样是不太好的。

我只想提取那些你可以从浏览器中复制的可读文本。

网页抓取 html解析 beautifulsoup 可读文本提取 html2text javascript过滤

4 个回答

你可以在Beautiful Soup中去掉脚本标签，像这样：

for script in soup("script"):
    script.extract()

去除元素

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup，类似下面这样的代码：

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

回答于 2025-04-16 由 Python大师

分享举报

如果你想在使用BeautifulSoup的时候，不提取任何script标签里的内容，

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

这个方法可以帮你做到，它会获取根节点下的直接子节点，这些子节点不是script标签（另外一个htmlDom.findAll(recursive=False, text=True)可以获取根节点下直接的文本字符串）。你需要递归地进行这个操作；比如，可以用生成器来实现：

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

我在用childGenerator（替代findAll）这样我可以按顺序获取所有的子节点，并进行自己的筛选。

回答于 2025-04-16 由 Python大师

分享举报

如何用Python从HTML中提取可读文本？

4 个回答

撰写回答