从HTML页面中删除样板内容

2024-05-29 10:32:09 发布

男 | 程序猿一只，喜欢编程写python代码。

我想使用这里的jusText实现https://github.com/miso-belica/jusText从html页面中获取干净的内容。基本上是这样工作的：

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
      print paragraph.text

我已经下载了我想用这个工具解析的页面（其中一些页面已经无法在线使用），并从中提取了html内容。由于jusText似乎只处理一个请求的输出（它是一个响应类型对象），我想知道是否有任何自定义的方法来设置响应对象的内容以包含我要解析的html文本。在

Tags：对象 https import github com 内容 get response

1条回答

网友

1楼 · 发布于 2024-05-29 10:32:09

response.content属于{}

>>> from requests import get
>>> r = get("http://www.google.com/")
>>> type(r.content)
<type 'str'>

所以打电话给：

^{pr2}$

从HTML页面中删除样板内容

相关问题更多 >

编程相关推荐

热门问题

热门文章

从HTML页面中删除样板内容

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >