BeautifulSoup4 获取文本仍然有 JavaScript

52 投票

2 回答

24670 浏览

数据工程师

提问于 2025-04-18 00:56

我正在尝试用bs4去掉所有的html和javascript，但javascript还是没去掉，文本里仍然能看到它。有什么办法可以解决这个问题吗？

我试过用，效果不错，不过clean_html和clean_url将来会被移除。有没有办法用soup的get_text来达到同样的效果呢？

我还查看了其他一些页面：

BeautifulSoup的get_text不能去掉所有标签和JavaScript

目前我在使用nltk的一些过时功能。

编辑

这里有个例子：

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我在CNN上仍然能看到以下内容：

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

我该如何去掉这些js呢？

我找到的其他选项只有：

https://github.com/aaronsw/html2text

不过html2text的问题是，有时候它真的非常慢，会造成明显的延迟，而这一点是nltk一直做得很好的。

javascript 文本处理网页抓取 html解析数据清洗 beautifulsoup nltk 标签去除

2 个回答

为了防止在最后出现编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

回答于 2025-04-18 由 Python大师

分享举报

这段内容部分是基于一个问题，标题是“我可以用BeautifulSoup移除脚本标签吗？”你可以在这个链接中找到更多信息。

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

回答于 2025-04-18 由 Python大师

分享举报

BeautifulSoup4 获取文本仍然有 JavaScript

2 个回答

撰写回答