从任何url中提取标题、图像和描述。

summary-extraction的Python项目详细描述


当前版本:v0.2-有关详细信息,请参阅changes.txt。

简单用法

使用summary包:

>>> import summary
>>> s = summary.Summary('https://github.com/svven/summary')
>>> s.extract()
>>> s.title
u'svven/summary'
>>> s.image
https://avatars0.githubusercontent.com/u/7524085?s=400
>>> s.description
u'summary - Summary is a complete solution to extract the title, image and description from any URL.'

使用HTML呈现进行批处理

如果使用fork或克隆repo,则可以使用summary.py,如下所示:

>>> import summary
>>> summary.GET_ALL_DATA = True # default is False
>>> urls = [
        'http://www.wired.com/',
        'http://www.nytimes.com/',
        'http://www.technologyreview.com/lists/technologies/2014/'
    ]
>>> from summarize import summarize, render
>>> summaries, result, speed = summarize(urls)
-> http://www.wired.com/
[BadImage] RatioImageException(398, 82): http://www.wired.com/wp-content/vendor/condenast/pangea/themes/wired/assets/images/wired_logo.gif
-> http://www.nytimes.com/
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nyt.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nytcom.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-4panel-opinion.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375173/CRS-1572_nytpinion_EARS_L_184x90_CP2.gif
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375174/CRS-1572_nytpinion_EARS_R_184x90_ER1.gif
[BadImage] RatioImageException(379, 64): http://i1.nyt.com/images/misc/nytlogo379x64.gif
[BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/facebook.gif
[BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/twitter.gif-> http://www.technologyreview.com/lists/technologies/2014/
Success: 3.
>>> html = render(template="news.html",
    summaries=summaries, result=result, speed=speed)
>>> with open('demo.html', 'w') as file:
...   file.write(html)
>>>

简而言之

summary从url请求页面,然后使用 extraction解析 HTML。

值得一提的是,它首先下载头标签,执行 特定的提取技术,只有在 提取的数据不完整。除非summary.GET_ALL_DATA = True

标题、图像和描述的结果列表将被筛选到 排除广告、小图像(跟踪)等不需要的项目 图像或共享按钮),以及纯白色图像。查看整个列表 下面的过滤器。

非常感谢威尔·拉森(@lethain) 为了适应他的extraction 库到版本0.2以容纳摘要。

渲染

html呈现机制的目的只是可视化 提取的数据。 包含的jinja2模板(news.html)构建在引导程序之上,并以一个响应良好的网格布局显示摘要。

您可以完全忽略渲染机制, 导入摘要模块以进行数据提取和筛选。你可能 有自己的方法呈现数据,因此只需要摘要 文件夹

image

啊![新闻.html 预览](https://dl.dropboxusercontent.com/u/134594/Svven/news.png

这是具有summary.GET_ALL_DATA = True的输出。

单击摘要标题、图像和描述循环 多个提取值。

<;https://dl.dropboxusercontent.com/u/134594/svven/news.html>;

而这一个产生得更快(见页脚) summary.GET_ALL_DATA = False。它只包含第一个有效项 每种类型-标题、图像和描述。这是默认值 行为。

<;https://dl.dropboxusercontent.com/u/134594/svven/fast.html>;

安装

pip it用于简单用法:

$ pip install summary-extraction

如果需要渲染,也可以克隆repo:

$ virtualenv env
$ source env/bin/activate
$ git clone https://github.com/svven/summary.git
$ pip install -r summary/requirements.txt

$ cd summary
$ python # see the usage instructions above

要求

基本必需的包是extractionrequests,但是没有adblockparserPillow

Jinja2==2.7.2 # only for rendering
Pillow==2.4.0
adblockparser==0.2
extraction==0.2
lxml==3.3.5
re2==0.2.20 # good for adblockparser
requests==2.2.1
w3lib==1.6

过滤器

过滤器是执行特定数据检查的callable类。

目前只有图像过滤器。图像url作为 输入参数到第一个过滤器。将执行检查并显示URL 如果它是有效的,则返回,因此它被传递到第二个筛选器,依此类推 打开。当检查失败时,它返回None

此模式使您可以像这样编写筛选例程:

def _filter_image(self, url):
  "The param is the image URL, which is returned if it passes *all* the filters."
  return reduce(lambda f, g: f and g(f),
    [
      filters.AdblockURLFilter()(url),
      filters.NoImageFilter(),
      filters.SizeImageFilter(),
      filters.MonoImageFilter(),
      filters.FormatImageFilter(),
    ])

images = filter(None, map(self._filter_image, image_urls))
  • adblockurlfilter

    Uses adblockparser and returns ^{tt9}$ if it ^{tt11}$ the URL.

    Hats off to Mikhail Korobov (@kmike) for the awesome work. It gives a lot of value to this mashup repo.

  • noimagefilter

    Retrieves actual image file, and returns ^{tt9}$ if it fails.

    Otherwise it returns an instance of the ^{tt13}$ class containing the URL, together with the size and format of the actual image. Basically it hydrates this instance which is passed to following filters. The ^{tt14}$ override returns just the URL so we can write the beautiful filtering routine you can see above.

    Worth mentioning again that it only gets first few chunks of the image file until the PIL parser gets the size and format of the image.

  • sizeImageFilter

    Checks the ^{tt13}$ instance to have proper size.

    This can raise following exceptions based on defined limits: ^{tt16}$, ^{tt17}$, or ^{tt18}$. If any of these happens it returns ^{tt9}$.

  • monoimagefilter

    Checks whether the image is plain white and returns ^{tt9}$.

    This filter retrieves the whole image file so it has an extra regex check before. E.g.: rules out these URLs:

  • formatImageFilter

    Rules out animated gif images for the moment. This can be extended to exclude other image formats based on file contents.

现在就这样。非常欢迎您的贡献。

欢迎提出意见和建议。干杯,@ducu

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java接口中的每个方法都是抽象的,但在抽象类中,我们也只能使用抽象方法   初始化Java中声明的、未初始化的变量会发生什么情况?   java BouncyCastle openPGP将字节[]数组加密为csv文件   在Java中将类A(和所有子类)映射到类B的实例的字典   RSA公钥编码,在Java和Android中,代码相同,结果不同   java在安卓中实现数字检测语音识别   java取消选择复选框   java如何在其他配置中重用Maven配置XML片段   java有没有一种有效的方法来检查HashMap是否包含映射到相同值的键?   spring处理程序调度失败;嵌套的例外是java。lang.NoClassDefFoundError:org/apache/http/client/HttpClient   带有ehcache的java多层缓存   java如何访问chromium(或任何其他浏览器)cookie   java通过将两个集合与spring data mongodb data中的条件合并来获取计数   安卓中R.java的语法错误