BeautifulSoup4 获取文本仍然有 JavaScript
我正在尝试用bs4去掉所有的html和javascript,但javascript还是没去掉,文本里仍然能看到它。有什么办法可以解决这个问题吗?
我试过用clean_html
和clean_url
将来会被移除。有没有办法用soup的get_text
来达到同样的效果呢?
我还查看了其他一些页面:
BeautifulSoup的get_text不能去掉所有标签和JavaScript
目前我在使用nltk的一些过时功能。
编辑
这里有个例子:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
我在CNN上仍然能看到以下内容:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
我该如何去掉这些js呢?
我找到的其他选项只有:
https://github.com/aaronsw/html2text
不过html2text
的问题是,有时候它真的非常慢,会造成明显的延迟,而这一点是nltk一直做得很好的。
2 个回答
10
为了防止在最后出现编码错误...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
96
这段内容部分是基于一个问题,标题是“我可以用BeautifulSoup移除脚本标签吗?”你可以在这个链接中找到更多信息。
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)