如何使用BeautifulSoup抓取仅可见的网页文本？

156 投票

11 回答

182105 浏览

提问于 2025-04-15 17:13

基本上，我想用 BeautifulSoup 来抓取网页上严格的 可见文本。比如说，这个网页就是我的测试案例。我主要想获取正文内容（文章），也许还想要一些标签名称。我尝试过在这个 SO问题中的建议，但返回了很多我不想要的 <script> 标签和HTML注释。我搞不清楚在 findAll() 函数中需要用什么参数，才能只获取网页上的可见文本。

那么，我该如何找到所有可见文本，排除脚本、注释、CSS等内容呢？

数据提取网页抓取 html解析 beautifulsoup 标签过滤网页内容可见文本网页分析

11 个回答

在编程中，有时候我们会遇到一些问题，比如代码运行不正常或者出现错误。这些问题可能是因为我们写的代码有bug，或者是因为我们没有正确理解某些概念。

当我们在网上寻找解决方案时，像StackOverflow这样的网站就非常有用。这里有很多开发者分享他们的经验和解决方案，帮助其他人解决类似的问题。

如果你在学习编程，遇到困难，不妨去这些论坛看看，可能会找到你需要的答案。同时，也要记得多动手实践，只有通过实际操作，才能更好地理解编程的奥妙。

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

回答于 2025-04-15 由 Python大师

分享举报

来自@jbochi的认可回答对我来说不管用。因为在调用str()函数时，它会出现错误，因为它无法处理BeautifulSoup元素中的非ASCII字符。这里有一个更简洁的方法，可以从示例网页中过滤出可见的文本。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

回答于 2025-04-15 由 Python大师

分享举报

304

试试这个：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

回答于 2025-04-15 由 Python大师

分享举报

如何使用BeautifulSoup抓取仅可见的网页文本？

11 个回答

撰写回答