一个用来抓取网站和抓取图像的网络机器人。
imagebot的Python项目详细描述
这个机器人(图像刮板)抓取一个给定的url并下载所有图像。
功能
- 支持的平台:linux/windows/python 2.7。
- 维护所有下载图像的数据库,以避免重复下载。
- 可选地,它只能在特定的url下进行scrape,例如scrapinghttp://website.com/albums/new使用此选项将只从新专辑下载。
- 按regex筛选url。
- 按最小大小过滤图像。
- 勉强通过javascript弹出链接(有限的支持)。
- 实时监视窗口,用于在图像被刮擦时显示图像。
- 使用scrapy和twisted的异步i/o设计。
用法
爬网命令:
抓取图像:
imagebot crawl http://website.com imagebot crawl http://website.com,http://otherwebsite.com
爬网命令选项:
-d, –domains
Scrape images while allowing images to be downloaded from other domain(s) (add multiple domains with comma separated list). The domain in the start url(s) is(are) allowed by default.
^{tt1}$
-is, –images-store
Specify image store location. Default: ~/Pictures/crawled/[jobname]
^{tt2}$
-s, –min-size
Specify minimum size of image to be downloaded (width x height).
^{tt3}$
-u, –stay-under
Stay under the start url. Only those urls that have the start url as prefix will be crawled. Useful, for example, to crawl an album or a subsection on a website.
^{tt4}$
-m, –monitor
Launch monitor window for displaying images as they are scraped.
^{tt5}$
-a, –user-agent
Set user-agent string. Default: imagebot. It is recommended to change it to identify your bot as a matter of responsible crawling.
^{tt6}$
-r, –url-regex
Specify regex for urls. Only those urls matching the regex will be crawled. It does not apply to start url(s).
^{tt7}$
-dl, –depth-limit
Specify depth limit for crawling. Use value of 0 to scrape only on start url(s).
^{tt8}$
–no-cdns
A list of well known cdn’s is included and enabled by default for image downloads. Use this option to disable it.
-at, –auto-throttle
Enable auto throttle feature of scrapy. (details in scrapy docs).
-j, –jobname
Specify a job name. This will be used to store image meta data in the database. By default, domain name of the start url is used as the job name.
-nc, –no-cache
Disable http caching.
-l, –log-level
Specify log level. Supported levels: info, silent, critical, error, debug, warning. Default: error.
^{tt9}$
-h, –help
Get help on crawl command options.
clear命令:
此命令可用于各种清理。
清除命令选项:
–cache
Clear http cache.
–db
Remove image metadata for a job from the database.
^{tt10}$
–duplicate-images
Multiple copies of same image may be downloaded due to different urls. Use this option to delete duplicate images for a job.
^{tt11}$
-h, –help
Get help on clear command options.
依赖关系
pywin32(http://sourceforge.net/projects/pywin32/)
Needed on Windows.
python gi(python gobject内省api)
Needed on Linux for gtk UI. (Optional). If not found, python built-in Tkinter will be used. On Ubuntu: ^{tt12}$
scrapy(web爬行框架)
It will be automatically installed by pip.
枕头(python图像库)
It will be automatically installed by pip.