一个用来抓取网站和抓取图像的网络机器人。

imagebot的Python项目详细描述


这个机器人(图像刮板)抓取一个给定的url并下载所有图像。

功能

  • 支持的平台:linux/windows/python 2.7。
  • 维护所有下载图像的数据库,以避免重复下载。
  • 可选地,它只能在特定的url下进行scrape,例如scrapinghttp://website.com/albums/new使用此选项将只从新专辑下载。
  • 按regex筛选url。
  • 按最小大小过滤图像。
  • 勉强通过javascript弹出链接(有限的支持)。
  • 实时监视窗口,用于在图像被刮擦时显示图像。
  • 使用scrapy和twisted的异步i/o设计。

用法

爬网命令:

  • 抓取图像:

    imagebot crawl http://website.com
    imagebot crawl http://website.com,http://otherwebsite.com
    
  • 爬网命令选项:

    -d, –domains

    Scrape images while allowing images to be downloaded from other domain(s) (add multiple domains with comma separated list). The domain in the start url(s) is(are) allowed by default.

    ^{tt1}$

    -is, –images-store

    Specify image store location. Default: ~/Pictures/crawled/[jobname]

    ^{tt2}$

    -s, –min-size

    Specify minimum size of image to be downloaded (width x height).

    ^{tt3}$

    -u, –stay-under

    Stay under the start url. Only those urls that have the start url as prefix will be crawled. Useful, for example, to crawl an album or a subsection on a website.

    ^{tt4}$

    -m, –monitor

    Launch monitor window for displaying images as they are scraped.

    ^{tt5}$

    -a, –user-agent

    Set user-agent string. Default: imagebot. It is recommended to change it to identify your bot as a matter of responsible crawling.

    ^{tt6}$

    -r, –url-regex

    Specify regex for urls. Only those urls matching the regex will be crawled. It does not apply to start url(s).

    ^{tt7}$

    -dl, –depth-limit

    Specify depth limit for crawling. Use value of 0 to scrape only on start url(s).

    ^{tt8}$

    –no-cdns

    A list of well known cdn’s is included and enabled by default for image downloads. Use this option to disable it.

    -at, –auto-throttle

    Enable auto throttle feature of scrapy. (details in scrapy docs).

    -j, –jobname

    Specify a job name. This will be used to store image meta data in the database. By default, domain name of the start url is used as the job name.

    -nc, –no-cache

    Disable http caching.

    -l, –log-level

    Specify log level. Supported levels: info, silent, critical, error, debug, warning. Default: error.

    ^{tt9}$

    -h, –help

    Get help on crawl command options.

clear命令:

  • 此命令可用于各种清理。

  • 清除命令选项:

    –cache

    Clear http cache.

    –db

    Remove image metadata for a job from the database.

    ^{tt10}$

    –duplicate-images

    Multiple copies of same image may be downloaded due to different urls. Use this option to delete duplicate images for a job.

    ^{tt11}$

    -h, –help

    Get help on clear command options.

依赖关系

  1. pywin32(http://sourceforge.net/projects/pywin32/

    Needed on Windows.

  2. python gi(python gobject内省api)

    Needed on Linux for gtk UI. (Optional). If not found, python built-in Tkinter will be used. On Ubuntu: ^{tt12}$

  3. scrapy(web爬行框架)

    It will be automatically installed by pip.

  4. 枕头(python图像库)

    It will be automatically installed by pip.

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
javascript如何找到socket。同一局域网上的IO服务器?   将Java代码格式化为Word/RTF格式   java学习对象以及如何将分配的变量封装到私有变量   java Websocket客户端不在Tomcat中工作   java如何在点击按钮时打开本机表情键盘?   java使用哪个Maven GlassFish插件?   Eclipse Java构建路径不允许添加外部JAR   继承Java6集合。勾选适当的用法   JavaApacheDateUtils:使用多个模式解析日期   java hibernate如何生成查询?   具有id或链接的java Dropbox下载文件或文件夹   java模态对话框未在PrimeFaces 5上显示   java将类对象转换为人类可读的字符串   更新数据库中字段的java通用方法   java无法通过Apache Tomcat访问网络文件夹