Python imagebot包_程序模块 - PyPI

一个用来抓取网站和抓取图像的网络机器人。

imagebot的Python项目详细描述

这个机器人（图像刮板）抓取一个给定的url并下载所有图像。

功能

支持的平台：linux/windows/python 2.7。
维护所有下载图像的数据库，以避免重复下载。
可选地，它只能在特定的url下进行scrape，例如scrapinghttp://website.com/albums/new使用此选项将只从新专辑下载。
按regex筛选url。
按最小大小过滤图像。
勉强通过javascript弹出链接（有限的支持）。
实时监视窗口，用于在图像被刮擦时显示图像。
使用scrapy和twisted的异步i/o设计。

用法

爬网命令：

抓取图像：

imagebot crawl http://website.com
imagebot crawl http://website.com,http://otherwebsite.com

爬网命令选项：
-d, –domains
Scrape images while allowing images to be downloaded from other domain(s) (add multiple domains with comma separated list). The domain in the start url(s) is(are) allowed by default.
^{tt1}$
-is, –images-store
Specify image store location. Default: ~/Pictures/crawled/[jobname]
^{tt2}$
-s, –min-size
Specify minimum size of image to be downloaded (width x height).
^{tt3}$
-u, –stay-under
Stay under the start url. Only those urls that have the start url as prefix will be crawled. Useful, for example, to crawl an album or a subsection on a website.
^{tt4}$
-m, –monitor
Launch monitor window for displaying images as they are scraped.
^{tt5}$
-a, –user-agent
Set user-agent string. Default: imagebot. It is recommended to change it to identify your bot as a matter of responsible crawling.
^{tt6}$
-r, –url-regex
Specify regex for urls. Only those urls matching the regex will be crawled. It does not apply to start url(s).
^{tt7}$
-dl, –depth-limit
Specify depth limit for crawling. Use value of 0 to scrape only on start url(s).
^{tt8}$
–no-cdns
A list of well known cdn’s is included and enabled by default for image downloads. Use this option to disable it.
-at, –auto-throttle
Enable auto throttle feature of scrapy. (details in scrapy docs).
-j, –jobname
Specify a job name. This will be used to store image meta data in the database. By default, domain name of the start url is used as the job name.
-nc, –no-cache
Disable http caching.
-l, –log-level
Specify log level. Supported levels: info, silent, critical, error, debug, warning. Default: error.
^{tt9}$
-h, –help
Get help on crawl command options.

clear命令：

此命令可用于各种清理。
清除命令选项：
–cache
Clear http cache.
–db
Remove image metadata for a job from the database.
^{tt10}$
–duplicate-images
Multiple copies of same image may be downloaded due to different urls. Use this option to delete duplicate images for a job.
^{tt11}$
-h, –help
Get help on clear command options.

依赖关系

pywin32（http://sourceforge.net/projects/pywin32/）
Needed on Windows.
python gi（python gobject内省api）
Needed on Linux for gtk UI. (Optional). If not found, python built-in Tkinter will be used. On Ubuntu: ^{tt12}$
scrapy（web爬行框架）
It will be automatically installed by pip.
枕头（python图像库）
It will be automatically installed by pip.

下载

PYPI:http://pypi.python.org/pypi/imagebot/
来源：https://github.com/amol9/imagebot/

欢迎加入QQ群-->： 979659372

imagebot 1.2.1

imagebot的Python项目详细描述

功能

用法

依赖关系

下载

推荐PyPI第三方库

lambdata-DS9

deep-utils.dynamodb

phhelper

ecasbot

compare-date-range

sphinxcontrib-lpblocks

MeTEA

swh.web.client

snapwell

sumant-sales

logicycl

xapian-bindings

stmetrics

calibrestekje

apache-airflow-backport-providers-ftp

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

imagebot 1.2.1

imagebot的Python项目详细描述

功能

用法

依赖关系

下载

推荐PyPI第三方库

lambdata-DS9

deep-utils.dynamodb

phhelper

ecasbot

compare-date-range

sphinxcontrib-lpblocks

MeTEA

swh.web.client

snapwell

sumant-sales

logicycl

xapian-bindings

stmetrics

calibrestekje

apache-airflow-backport-providers-ftp

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签