Python web-trawler包_程序模块 - PyPI

在网页上搜索要下载的文件

web-trawler的Python项目详细描述

给定html网页的url，这个python包异步下载所有非web 从该网页链接到的文件，例如音频文件、Excel文档等（可选），所有从原始网页链接到的网页也可以拖网查找文件。

安装

Python 3必须安装在系统路径中。也就是说，它必须是一个公认的命令用于命令行界面。在命令行中输入python --version，查看您已经安装了python 3。

python包管理器pip也是必需的。通过运行pip --version检查您是否拥有它。它自动与最新版本的python一起安装，但也可以手动安装。见the official installation instructions

安装Web拖网渔船包

在命令行界面中运行以下代码（不包括$，它只是一个提示图标）：

$ pip install web_trawler --upgrade

包没有外部依赖项。对于测试，pytest是必需的。

Web拖网渔船的源代码可以在gitlab.com上找到。

用法

命令行

一旦安装，就可以像这样使用Web拖网渔船：

$ web_trawler google.com

运行此命令以查看Web拖网渔船如何查找链接并检查其http头以获取更多信息。一组日志事件将输出到控制台。通常没有链接到google.com的文件，但如果有，它们将被下载到相对于运行命令的目录download/。

url参数是必需的。此外，还支持以下可选参数：

--target TARGET
Give a path for where you would like the files to be downloaded. The default path is “download”.
--add_links_from_linked_pages
Set web_trawler to trawl pages linked to from the original web page as well (only goes one step, and only for links within the domain of the original web page)
--interactive Short version is “-i”. Asks user about whether or not to trawl each linked page (has no effect unless the –add_links_from_linked_pages flag is set to true.
--interactive_download_prompt
Short version is “-I”. Asks user about whether or not to download each of the files found.
--quiet Suppresses output information about which links are being processed and which files are being downloaded.
--processes PROCESSES
Manually set how many processes will be spawned. The default is to spawn one less than the number of processors detected (so as not to stall the system). For each process, up to 10 threads are spawned.
--whitelist WHITELIST
Space-separated file endings to whitelist. Allows use of wildcards, e.g. “xls*” to capture all the Excel file extension variants, like xlsx, xlsb, xlsm and xls. A given blacklist takes precedence over the whitelist.
--blacklist BLACKLIST
Space-separated file endings to blacklist. Works just whitelist, only it excludes files of the given file endings.
--no_of_files_limit LIMIT
Set a maximum number of files you are willing to download, in case web_trawler finds more than expected.
--mb_per_file_limit LIMIT
Set a maximum file size you are willing to download. Warnings are logged to console for each file excluded.

每个参数都有一个由其首字母组成的速记，例如-t、-a、-q等。

实际使用示例

如果我们想从 a web page on the World Input-Output Database site，进入名为“data”的本地目录，我们需要使用参数-t（对于目标）、-w（对于白名单）和-m。（对于每个文件的MB限制）：

$ web_trawler http://www.wiod.org/database/wiots16 -t "data" -w "zip xls*" -m 100

注意白名单中通配符的使用。网页指定了指向两个不同Excel的链接文件结尾。通配符确保两者都被捕获。

如果测试此命令，将开始下载一堆大文件。按ctrl-c或ctrl-z以分别中断或强制退出进程。

确保清除所有不需要的下载文件。它们应该位于与运行命令。如果未指定目标，则会将其下载到名为“下载”的目录中。

包括链接页面的链接

要查看-a参数如何在不启动一百万次下载的情况下工作，请运行以下命令，其中 -m 0确保跳过所有文件：

$ web_trawler http://www.wiod.org/database/wiots16 -a -m 0

注意，如果目标目录已经不存在，它仍将创建目标目录。

要得到是否添加链接到每个链接页面的文件的提示，请运行此命令，其中-a和-i命令已连接为一个，并且白名单被设置为尚未开始任何下载：

$ web_trawler http://www.wiod.org/database/wiots16 -ai -w "nosuchfileending"

在python中使用

下面的代码所做的与最后一个命令行用法示例完全相同：

import web_trawler

web_trawler.trawl("http://www.wiod.org/database/wiots16",
                  add_links_from_linked_pages=True, mb_per_file_limit=0)

函数trawl的作用与从命令行运行的web拖网渔船相同，但具有参数直接用python传递给它。

Web拖网渔船中使用的几个中介函数也可以通过Python访问，即列出网页上所有链接的信息，或者只列出指向文件的链接，并用黑名单过滤或者白名单。以下是它们的简要说明：

get_links: Takes only one argument, a url, and returns a list of Link namedtuples, described below. This list is unfiltered. All http links that return a http request are included.
get_file_links: Runs get_links and returns a filtered list of Link namedtuples for files only, with whitelist and/or blacklist applied if specified. Arguments have self-explanatory names. The whitelist and blacklist can be provided as a space-separated string or as a list.

get_links:	Takes only one argument, a url, and returns a list of Link namedtuples, described below. This list is unfiltered. All http links that return a http request are included.
get_file_links:	Runs get_links and returns a filtered list of Link namedtuples for files only, with whitelist and/or blacklist applied if specified. Arguments have self-explanatory names. The whitelist and blacklist can be provided as a space-separated string or as a list.

get_links和get_file_links返回带有以下字段的namedtuples列表：

href: the link url
title: the content of the <a> tag containing the link
mb: calculated from the http header content-length
type: the http header content-type, unmodified

href:	the link url
title:	the content of the `<a>` tag containing the link
mb:	calculated from the http header `content-length`
type:	the http header `content-type`, unmodified

在Matlab中使用

在垫子里实验室，可以使用py脚本调用pip安装的python包的函数，其中可选参数是使用pyargs函数指定的：

>>py.web_trawler.get_file_links('http://www.wiod.org/database/wiots16',pyargs('whitelist','xls* doc*'))

stdout不显示，这就是选择get_file_links函数的原因，因为它返回一些内容。要使用web拖网渔船的全部功能，可以运行trawl函数。只要没有错误，命令窗口中不会显示任何内容。但是文件还是会被下载，相对于您在Matlab中的当前文件夹。

欢迎加入QQ群-->： 979659372

web-trawler 0.2.0

web-trawler的Python项目详细描述

安装

安装Web拖网渔船包

用法

命令行

包括链接页面的链接

在python中使用

在Matlab中使用

推荐PyPI第三方库

pyconcrete

shl

yanlp

NodeBox-for-OpenGL

bottle-haml

edsudoku

dexterity.localrolesfield

gccinvocation

cloudlab

jadm

gnodeclient

sanepg

django-dress-blog

rsl.upnp

libtorrent-test

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

`--target TARGET`
	Give a path for where you would like the files to be downloaded. The default path is “download”.
`--add_links_from_linked_pages`
	Set web_trawler to trawl pages linked to from the original web page as well (only goes one step, and only for links within the domain of the original web page)
`--interactive`	Short version is “-i”. Asks user about whether or not to trawl each linked page (has no effect unless the –add_links_from_linked_pages flag is set to true.
`--interactive_download_prompt`
	Short version is “-I”. Asks user about whether or not to download each of the files found.
`--quiet`	Suppresses output information about which links are being processed and which files are being downloaded.
`--processes PROCESSES`
	Manually set how many processes will be spawned. The default is to spawn one less than the number of processors detected (so as not to stall the system). For each process, up to 10 threads are spawned.
`--whitelist WHITELIST`
	Space-separated file endings to whitelist. Allows use of wildcards, e.g. “xls*” to capture all the Excel file extension variants, like xlsx, xlsb, xlsm and xls. A given blacklist takes precedence over the whitelist.
`--blacklist BLACKLIST`
	Space-separated file endings to blacklist. Works just whitelist, only it excludes files of the given file endings.
`--no_of_files_limit LIMIT`
	Set a maximum number of files you are willing to download, in case web_trawler finds more than expected.
`--mb_per_file_limit LIMIT`
	Set a maximum file size you are willing to download. Warnings are logged to console for each file excluded.

web-trawler 0.2.0

web-trawler的Python项目详细描述

安装

安装Web拖网渔船包

用法

命令行

包括链接页面的链接

在python中使用

在Matlab中使用

推荐PyPI第三方库

pyconcrete

shl

yanlp

NodeBox-for-OpenGL

bottle-haml

edsudoku

dexterity.localrolesfield

gccinvocation

cloudlab

jadm

gnodeclient

sanepg

django-dress-blog

rsl.upnp

libtorrent-test

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签