Python scrape包_程序模块 - PyPI

一种命令行web抓取工具

scrape的Python项目详细描述

scrape

命令行web抓取工具

scrape是一个基于规则的网络爬虫和信息抽取工具能够处理和合并新的和现有的文档。XML路径语言（xpath）和正则表达式用于定义过滤内容和web遍历。输出可以转换成文本， csv、pdf和/或html格式。

安装

pip install scrape

或

pip install git+https://github.com/huntrar/scrape.git#egg=scrape

或

git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install

您必须安装 WKHTMLTOPDF 将文件保存为pdf格式。

用法

usage: scrape.py [-h] [-a [ATTRIBUTES [ATTRIBUTES ...]]] [-all]
                 [-c [CRAWL [CRAWL ...]]] [-C] [--csv] [-cs [CACHE_SIZE]]
                 [-f [FILTER [FILTER ...]]] [--html] [-i] [-m]
                 [-max MAX_CRAWLS] [-n] [-ni] [-no] [-o [OUT [OUT ...]]] [-ow]
                 [-p] [-pt] [-q] [-s] [-t] [-v] [-x [XPATH]]
                 [QUERY [QUERY ...]]

a command-line web scraping tool

positional arguments:
  QUERY                 URLs/files to scrape

optional arguments:
  -h, --help            show this help message and exit
  -a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
                        extract text using tag attributes
  -all, --crawl-all     crawl all pages
  -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                        regexp rules for following new pages
  -C, --clear-cache     clear requests cache
  --csv                 write files as csv
  -cs [CACHE_SIZE], --cache-size [CACHE_SIZE]
                        size of page cache (default: 1000)
  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                        regexp rules for filtering text
  --html                write files as HTML
  -i, --images          save page images
  -m, --multiple        save to multiple files
  -max MAX_CRAWLS, --max-crawls MAX_CRAWLS
                        max number of pages to crawl
  -n, --nonstrict       allow crawler to visit any domain
  -ni, --no-images      do not save page images
  -no, --no-overwrite   do not overwrite files if they exist
  -o [OUT [OUT ...]], --out [OUT [OUT ...]]
                        specify outfile names
  -ow, --overwrite      overwrite a file if it exists
  -p, --pdf             write files as pdf
  -pt, --print          print text output
  -q, --quiet           suppress program output
  -s, --single          save to a single file
  -t, --text            write files as text
  -v, --version         display current version
  -x [XPATH], --xpath [XPATH]
                        filter HTML using XPath

作者

亨特·哈蒙德（hunter hammond）（亨特rar @ gmail com）

注释

同时支持Python2.x和Python3.x。
scrape的输入可以是链接、文件或两者的组合，允许您创建由现有和新收集的内容。
多个输入文件/url保存到多个输出默认情况下为文件/目录。要整合它们，请使用–single 旗. < /LI>
当保存为pdf或html时，会自动包含图像；这包括发出额外的http请求，添加大量处理时间。如果要放弃此功能，请使用 –无图像标志，或设置环境变量刮除禁用图像。
默认情况下，已启用请求缓存来缓存网页，它可以是通过设置环境变量scrape_disable_cache禁用。
在处理过程中，页面临时保存为part.html文件。除非将页面保存为HTML，否则这些文件将自动删除转换或退出时。
要在没有限制的情况下对页面进行爬网，请使用–crawl all标志，或通过传递一个或多个url关键字筛选要爬网的页面 regexps to–爬行。
如果希望爬网程序跟踪给定URL之外的链接域，使用–非价格。
可以通过ctrl-c或设置要使用-maxpages和 –麦克斯林克斯。一个页面可能包含零个或多个指向更多页面的链接。
刮削文件的文本输出可以打印到stdout，而不是通过输入–打印保存。
可以使用-xpath过滤html，而过滤文本是通过将一个或多个regexp输入–filter来完成。
如果只想指定要提取的特定标记属性而不是整个xpath，使用–attributes。默认选择是只提取文本属性，但可以指定一个或多个不同的属性（例如，ref、src、title或任何属性可用..）。

欢迎加入QQ群-->： 979659372

scrape 0.9.15

scrape的Python项目详细描述

scrape

命令行web抓取工具

安装

用法

作者

注释

推荐PyPI第三方库

fritz

django-bower-cache

tablecalculation

collective.fsdsimplifier

pox.banner

fdoc-death-scraper

rvlm.entrypoint

neutronp

ecell4-base

libpuzzle

opper

stormbot-fortune

pyjulius3

smev3Transform

muffin-oauth

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

scrape 0.9.15

scrape的Python项目详细描述

scrape

命令行web抓取工具

安装

用法

作者

注释

推荐PyPI第三方库

fritz

django-bower-cache

tablecalculation

collective.fsdsimplifier

pox.banner

fdoc-death-scraper

rvlm.entrypoint

neutronp

ecell4-base

libpuzzle

opper

stormbot-fortune

pyjulius3

smev3Transform

muffin-oauth

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签