BeautifulSoup4 获取文本仍然有 JavaScript

52 投票
2 回答
24670 浏览
提问于 2025-04-18 00:56

我正在尝试用bs4去掉所有的html和javascript,但javascript还是没去掉,文本里仍然能看到它。有什么办法可以解决这个问题吗?

我试过用,效果不错,不过clean_htmlclean_url将来会被移除。有没有办法用soup的get_text来达到同样的效果呢?

我还查看了其他一些页面:

BeautifulSoup的get_text不能去掉所有标签和JavaScript

目前我在使用nltk的一些过时功能。

编辑

这里有个例子:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我在CNN上仍然能看到以下内容:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

我该如何去掉这些js呢?

我找到的其他选项只有:

https://github.com/aaronsw/html2text

不过html2text的问题是,有时候它真的非常慢,会造成明显的延迟,而这一点是nltk一直做得很好的。

2 个回答

10

为了防止在最后出现编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))
96

这段内容部分是基于一个问题,标题是“我可以用BeautifulSoup移除脚本标签吗?”你可以在这个链接中找到更多信息。

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

撰写回答