Python summary-extraction包_程序模块 - PyPI

从任何url中提取标题、图像和描述。

summary-extraction的Python项目详细描述

当前版本：v0.2-有关详细信息，请参阅changes.txt。

简单用法

使用summary包：

>>> import summary
>>> s = summary.Summary('https://github.com/svven/summary')
>>> s.extract()
>>> s.title
u'svven/summary'
>>> s.image
https://avatars0.githubusercontent.com/u/7524085?s=400
>>> s.description
u'summary - Summary is a complete solution to extract the title, image and description from any URL.'

使用HTML呈现进行批处理

如果使用fork或克隆repo，则可以使用summary.py，如下所示：

>>> import summary
>>> summary.GET_ALL_DATA = True # default is False
>>> urls = [
        'http://www.wired.com/',
        'http://www.nytimes.com/',
        'http://www.technologyreview.com/lists/technologies/2014/'
    ]
>>> from summarize import summarize, render
>>> summaries, result, speed = summarize(urls)
-> http://www.wired.com/
[BadImage] RatioImageException(398, 82): http://www.wired.com/wp-content/vendor/condenast/pangea/themes/wired/assets/images/wired_logo.gif
-> http://www.nytimes.com/
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nyt.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nytcom.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-4panel-opinion.png
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375173/CRS-1572_nytpinion_EARS_L_184x90_CP2.gif
[BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375174/CRS-1572_nytpinion_EARS_R_184x90_ER1.gif
[BadImage] RatioImageException(379, 64): http://i1.nyt.com/images/misc/nytlogo379x64.gif
[BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/facebook.gif
[BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/twitter.gif-> http://www.technologyreview.com/lists/technologies/2014/
Success: 3.
>>> html = render(template="news.html",
    summaries=summaries, result=result, speed=speed)
>>> with open('demo.html', 'w') as file:
...   file.write(html)
>>>

简而言之

summary从url请求页面，然后使用 extraction解析 HTML。

值得一提的是，它首先下载头标签，执行特定的提取技术，只有在提取的数据不完整。除非summary.GET_ALL_DATA = True。

标题、图像和描述的结果列表将被筛选到排除广告、小图像（跟踪）等不需要的项目图像或共享按钮），以及纯白色图像。查看整个列表下面的过滤器。

非常感谢威尔·拉森（@lethain）为了适应他的extraction 库到版本0.2以容纳摘要。

渲染

html呈现机制的目的只是可视化提取的数据。包含的jinja2模板（news.html）构建在引导程序之上，并以一个响应良好的网格布局显示摘要。

您可以完全忽略渲染机制，导入摘要模块以进行数据提取和筛选。你可能有自己的方法呈现数据，因此只需要摘要文件夹

啊！[新闻.html 预览]（https://dl.dropboxusercontent.com/u/134594/Svven/news.png）

这是具有summary.GET_ALL_DATA = True的输出。

单击摘要标题、图像和描述循环多个提取值。

<；https://dl.dropboxusercontent.com/u/134594/svven/news.html>；

而这一个产生得更快（见页脚） summary.GET_ALL_DATA = False。它只包含第一个有效项每种类型-标题、图像和描述。这是默认值行为。

<；https://dl.dropboxusercontent.com/u/134594/svven/fast.html>；

安装

pip it用于简单用法：

$ pip install summary-extraction

如果需要渲染，也可以克隆repo：

$ virtualenv env
$ source env/bin/activate
$ git clone https://github.com/svven/summary.git
$ pip install -r summary/requirements.txt

$ cd summary
$ python # see the usage instructions above

要求

基本必需的包是extraction和requests，但是没有adblockparser和Pillow：

Jinja2==2.7.2 # only for rendering
Pillow==2.4.0
adblockparser==0.2
extraction==0.2
lxml==3.3.5
re2==0.2.20 # good for adblockparser
requests==2.2.1
w3lib==1.6

过滤器

过滤器是执行特定数据检查的callable类。

目前只有图像过滤器。图像url作为输入参数到第一个过滤器。将执行检查并显示URL 如果它是有效的，则返回，因此它被传递到第二个筛选器，依此类推打开。当检查失败时，它返回None。

此模式使您可以像这样编写筛选例程：

def _filter_image(self, url):
  "The param is the image URL, which is returned if it passes *all* the filters."
  return reduce(lambda f, g: f and g(f),
    [
      filters.AdblockURLFilter()(url),
      filters.NoImageFilter(),
      filters.SizeImageFilter(),
      filters.MonoImageFilter(),
      filters.FormatImageFilter(),
    ])

images = filter(None, map(self._filter_image, image_urls))

adblockurlfilter
Uses adblockparser and returns ^{tt9}$ if it ^{tt11}$ the URL.
Hats off to Mikhail Korobov (@kmike) for the awesome work. It gives a lot of value to this mashup repo.
noimagefilter
Retrieves actual image file, and returns ^{tt9}$ if it fails.
Otherwise it returns an instance of the ^{tt13}$ class containing the URL, together with the size and format of the actual image. Basically it hydrates this instance which is passed to following filters. The ^{tt14}$ override returns just the URL so we can write the beautiful filtering routine you can see above.
Worth mentioning again that it only gets first few chunks of the image file until the PIL parser gets the size and format of the image.
sizeImageFilter
Checks the ^{tt13}$ instance to have proper size.
This can raise following exceptions based on defined limits: ^{tt16}$, ^{tt17}$, or ^{tt18}$. If any of these happens it returns ^{tt9}$.
monoimagefilter
Checks whether the image is plain white and returns ^{tt9}$.
This filter retrieves the whole image file so it has an extra regex check before. E.g.: rules out these URLs:
- http://wordpress.com/i/blank.jpg?m=1383295312g
- http://images.inc.com/leftnavmenu/inc-logo-white.png
formatImageFilter
Rules out animated gif images for the moment. This can be extended to exclude other image formats based on file contents.

现在就这样。非常欢迎您的贡献。

欢迎提出意见和建议。干杯，@ducu

欢迎加入QQ群-->： 979659372

summary-extraction 0.2

summary-extraction的Python项目详细描述

简单用法

使用HTML呈现进行批处理

简而言之

渲染

安装

要求

过滤器

推荐PyPI第三方库

dominoes

ggolkaq777

balderdash

cwpythonwrapper

enCompres

gevent-worker

pyramid-oas3

pymonoprice

pixplz

simpletransfers

rhinoMorph

supermath

Orange3-Prototypes

Dentacoin

tkp

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

summary-extraction 0.2

summary-extraction的Python项目详细描述

简单用法

使用HTML呈现进行批处理

简而言之

渲染

安装

要求

过滤器

推荐PyPI第三方库

dominoes

ggolkaq777

balderdash

cwpythonwrapper

enCompres

gevent-worker

pyramid-oas3

pymonoprice

pixplz

simpletransfers

rhinoMorph

supermath

Orange3-Prototypes

Dentacoin

tkp

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签