从任何url中提取标题、图像和描述。
summary-extraction的Python项目详细描述
当前版本:v0.2-有关详细信息,请参阅changes.txt。
简单用法
使用summary包:
>>> import summary >>> s = summary.Summary('https://github.com/svven/summary') >>> s.extract() >>> s.title u'svven/summary' >>> s.image https://avatars0.githubusercontent.com/u/7524085?s=400 >>> s.description u'summary - Summary is a complete solution to extract the title, image and description from any URL.'
使用HTML呈现进行批处理
如果使用fork或克隆repo,则可以使用summary.py,如下所示:
>>> import summary >>> summary.GET_ALL_DATA = True # default is False >>> urls = [ 'http://www.wired.com/', 'http://www.nytimes.com/', 'http://www.technologyreview.com/lists/technologies/2014/' ] >>> from summarize import summarize, render >>> summaries, result, speed = summarize(urls) -> http://www.wired.com/ [BadImage] RatioImageException(398, 82): http://www.wired.com/wp-content/vendor/condenast/pangea/themes/wired/assets/images/wired_logo.gif -> http://www.nytimes.com/ [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nyt.png [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nytcom.png [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-4panel-opinion.png [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375173/CRS-1572_nytpinion_EARS_L_184x90_CP2.gif [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375174/CRS-1572_nytpinion_EARS_R_184x90_ER1.gif [BadImage] RatioImageException(379, 64): http://i1.nyt.com/images/misc/nytlogo379x64.gif [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/facebook.gif [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/twitter.gif-> http://www.technologyreview.com/lists/technologies/2014/ Success: 3. >>> html = render(template="news.html", summaries=summaries, result=result, speed=speed) >>> with open('demo.html', 'w') as file: ... file.write(html) >>>
简而言之
summary从url请求页面,然后使用 extraction解析 HTML。
值得一提的是,它首先下载头标签,执行 特定的提取技术,只有在 提取的数据不完整。除非summary.GET_ALL_DATA = True。
标题、图像和描述的结果列表将被筛选到 排除广告、小图像(跟踪)等不需要的项目 图像或共享按钮),以及纯白色图像。查看整个列表 下面的过滤器。
非常感谢威尔·拉森(@lethain) 为了适应他的extraction 库到版本0.2以容纳摘要。
渲染
html呈现机制的目的只是可视化 提取的数据。 包含的jinja2模板(news.html)构建在引导程序之上,并以一个响应良好的网格布局显示摘要。
您可以完全忽略渲染机制, 导入摘要模块以进行数据提取和筛选。你可能 有自己的方法呈现数据,因此只需要摘要 文件夹
啊![新闻.html 预览](https://dl.dropboxusercontent.com/u/134594/Svven/news.png)
这是具有summary.GET_ALL_DATA = True的输出。
单击摘要标题、图像和描述循环 多个提取值。
<;https://dl.dropboxusercontent.com/u/134594/svven/news.html>;
而这一个产生得更快(见页脚) summary.GET_ALL_DATA = False。它只包含第一个有效项 每种类型-标题、图像和描述。这是默认值 行为。
<;https://dl.dropboxusercontent.com/u/134594/svven/fast.html>;
安装
pip it用于简单用法:
$ pip install summary-extraction
如果需要渲染,也可以克隆repo:
$ virtualenv env $ source env/bin/activate $ git clone https://github.com/svven/summary.git $ pip install -r summary/requirements.txt $ cd summary $ python # see the usage instructions above
要求
基本必需的包是extraction和requests,但是没有adblockparser和Pillow:
Jinja2==2.7.2 # only for rendering Pillow==2.4.0 adblockparser==0.2 extraction==0.2 lxml==3.3.5 re2==0.2.20 # good for adblockparser requests==2.2.1 w3lib==1.6
过滤器
过滤器是执行特定数据检查的callable类。
目前只有图像过滤器。图像url作为 输入参数到第一个过滤器。将执行检查并显示URL 如果它是有效的,则返回,因此它被传递到第二个筛选器,依此类推 打开。当检查失败时,它返回None。
此模式使您可以像这样编写筛选例程:
def _filter_image(self, url): "The param is the image URL, which is returned if it passes *all* the filters." return reduce(lambda f, g: f and g(f), [ filters.AdblockURLFilter()(url), filters.NoImageFilter(), filters.SizeImageFilter(), filters.MonoImageFilter(), filters.FormatImageFilter(), ]) images = filter(None, map(self._filter_image, image_urls))
adblockurlfilter
Uses adblockparser and returns ^{tt9}$ if it ^{tt11}$ the URL.
Hats off to Mikhail Korobov (@kmike) for the awesome work. It gives a lot of value to this mashup repo.
noimagefilter
Retrieves actual image file, and returns ^{tt9}$ if it fails.
Otherwise it returns an instance of the ^{tt13}$ class containing the URL, together with the size and format of the actual image. Basically it hydrates this instance which is passed to following filters. The ^{tt14}$ override returns just the URL so we can write the beautiful filtering routine you can see above.
Worth mentioning again that it only gets first few chunks of the image file until the PIL parser gets the size and format of the image.
sizeImageFilter
Checks the ^{tt13}$ instance to have proper size.
This can raise following exceptions based on defined limits: ^{tt16}$, ^{tt17}$, or ^{tt18}$. If any of these happens it returns ^{tt9}$.
monoimagefilter
Checks whether the image is plain white and returns ^{tt9}$.
This filter retrieves the whole image file so it has an extra regex check before. E.g.: rules out these URLs:
formatImageFilter
Rules out animated gif images for the moment. This can be extended to exclude other image formats based on file contents.
现在就这样。非常欢迎您的贡献。
欢迎提出意见和建议。干杯,@ducu