如何使用BeautifulSoup抓取仅可见的网页文本?
11 个回答
37
在编程中,有时候我们会遇到一些问题,比如代码运行不正常或者出现错误。这些问题可能是因为我们写的代码有bug,或者是因为我们没有正确理解某些概念。
当我们在网上寻找解决方案时,像StackOverflow这样的网站就非常有用。这里有很多开发者分享他们的经验和解决方案,帮助其他人解决类似的问题。
如果你在学习编程,遇到困难,不妨去这些论坛看看,可能会找到你需要的答案。同时,也要记得多动手实践,只有通过实际操作,才能更好地理解编程的奥妙。
import urllib
from bs4 import BeautifulSoup
url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
40
来自@jbochi的认可回答对我来说不管用。因为在调用str()函数时,它会出现错误,因为它无法处理BeautifulSoup元素中的非ASCII字符。这里有一个更简洁的方法,可以从示例网页中过滤出可见的文本。
html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
304
试试这个:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))