Python thredds_crawler包_程序模块 - PyPI

用于爬行thredds服务器的python库

thredds_crawler的Python项目详细描述

#履带式机器人[构建状态]（https://travis-ci.org/ioos/thredds_crawler.svg？branch=master)](https://travis-ci.org/ioos/thredds_crawler)

A simple crawler/parser for THREDDS catalogs

## Installation

```bash
pip install thredds_crawler
```

or

```bash
conda install -c conda-forge thredds_crawler
```

## Usage

### Select

You can select 使用'select'参数基于thredds id的数据集。支持python regex。

``python
from thredds_crawler.crawl import crawl
c=crawl（'http://tds.maracoos.org/thredds/modis.xml'，select=["*-agg"]）
print c.datasets
[
<；leadfataset id:modis agg，name:modis complete aggregation，services:['opendap'，'iso]>；，
<；leadfataset id:modis-2009-agg，名称：modis-2009聚合，服务：['opendap'，'iso']>；，
<；leaftdataset id:modis-2010-agg，名称：modis-2010聚合，服务：['opendap'，'iso']>；，
<；leaftdataset id:modis-2011-agg，名称：modis-2011聚合，服务：['opendap'，'iso']>；，
<；leaftdataset id:modis-2012-agg，名称：modis-2012聚合，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis-2013-agg，名称：modis-2013聚合，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis one agg，名称：1天聚合，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis three agg，名称：3天聚合，服务：['opendap'，'iso']>；，
<；leaftdataset id:modis seven agg，name:7天聚合，服务：['opendap'，'iso']>；
]
````

默认情况下，爬网程序使用一些常见的正则表达式跳过作为聚合或fmrcs一部分的数千个单独文件的列表：

*`.*文件.`
*`.*单独文件.`
*`.*文件访问.`
*`.*预测模型运行.`
*`.*恒定预测偏移量.`
*`.*恒定预测日期.`

by如果将"skip"参数设置为默认值的超集以外的任何其他值，则可能会有愤怒的系统管理员在您后面。

'.*单个文件.*，
'.*文件访问.*，
'.*预测模型运行.*，
'.*恒定预测偏移量.*，
'.*恒定预测日期.*'
]
`````

如果需要删除或添加新的"跳过"，强烈建议**使用"跳过"类变量作为起点！

``python
python
`` python
`` python
`` python
skipps=crawl.skipps+[".*-天聚合"]
c=crawl（
c=craw（
'http://tds.maracoos.org/thredds/modis.xml，
select=[".*-agg"]，
skip=skipps

<

lt；我是说，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis-2009-agg，名称：modis-2009聚合，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis-2010-agg，名称：modis-2010聚合，服务：['opendap'，'iso']>；，
<；叶数据集ID:modis-2011-agg，名称：modis-2011聚合，服务：['opendap'，'iso']>；，
<；leaftdataset id:modis-2012-2012-agg，name:modis-2012-2012聚合，services:['opendap'，'iso']>；，
<；leaftdataset id:modis-2013-agg，name:modis-2013聚合，services:['opendap'，'iso']]>；，

，
`````
>``````
>

>
默认情况下，爬行中使用的工作线程有4个工作线程，默认情况下，爬行中使用的工作线程有4个工作线程。是的。您可以通过指定"workers"参数来更改此设置。

``python
import time
from contextlib import contextmanager
from thredds\u crawler.crawl import crawl

@contextmanager
def timeit（name）：
starttime=time.time（）
yield
elapsedtime=time.time（）-starttime
print（'[{}]以{}ms.格式完成（name，int（elapsedtime*1000））

x in range（1，11）：
with timeit（{}workers.format（x））：
craw（"http://tds.maracoos.org/thredds/modis.xml"，workers=x）

[1个workers]在872 ms内完成
[2个workers]在397 ms内完成
[3个workers]在329 ms内完成
[4个workers]在260 ms内完成
[5个workers]在264 ms内完成
[6个workers]在219 ms内完成
[7个workers]在212 ms内完成
[8个woRkers]在185毫秒内完成
[9个工作人员]在217毫秒内完成
[10个工作人员]在205毫秒内完成
````

请记住，修改后的时间仅对thredds中托管的单个文件可用（而不是聚合）。

`` python
从thredds爬虫导入pytz
爬网导入爬网

bf=日期时间（2016年，1、5、0、0）
af=日期时间（2015年，12、30、0、0、tzinfo=pytz.utc）
url='hhttp://tds.maracoos.org/thredds/catalog/modis chesapeake chesapeake盐度/raw/2016/catalog.xml'

c=crawl（url，after=af）
assert len（c.datasets）在

<两者都
af af af=datetime（2016年1月1日，20日，2016年1月20日），两者都

0，0）
bf=datetime（2016，2，1，0，0）
c=crawl（url，before=bf，after=af）
assert len（c.datasets）==11
````

它需要是一个[请求兼容的身份验证对象]（http://docs.python requests.org/en/latest/user/authentication/）。

``python
来自thredds爬虫程序。爬网导入爬网
auth=（"user"，"password"）
c=爬网（
'http://tds.maracoos.org/thredds/modis.xml'，
select=['.*agg']，
skip=crawl.skips，
auth=auth
````

]
c=爬网（
'http://tds.maracoos.org/thredds/modis.xml'，
select=['.*-agg']，
skip=skips，
debug=true

爬网：http://tds.maracoos.org/thredds/modis.xml
跳过基于"skips"的catalogRef。标题：modis单个文件
基于"skips"跳过catalogref。标题：1天单个文件
基于"跳过"跳过catalogref。标题：3天单个文件
基于"跳过"跳过catalogref。标题：8天单个文件
处理modis agg
处理modis-2009-agg
处理modis-2010-agg
处理modis-2011-agg
处理modis-2012-agg
处理modis-2013-agg
跳过基于"跳过"的数据集。名称：1天聚合
```

如果您访问命名的
记录器，**在初始化爬网对象时不要**包含"debug=true"。

``python
`import logging
crawn`log=logging.get logger（'thredds`u crawler'）
crawn`log.setlevel（logging.warning）
```

dataset

有关叶数据集的SIC信息，包括可用的服务。

`` python
来自thredds爬虫程序。爬网导入爬网
c=crawl（'http://tds.maracoos.org/thredds/modis.xml'，select=['.*-agg']）
dataset=c.datasets[0]
打印dataset.id
modis agg
打印dataset.name
modis完全聚合
打印数据集。服务
[
{
"url"："http://tds.maracoos.org/thredds/dodsc/modis agg.nc"，
"name"："odap"，
"service"："opendap"
}，
{
"url"："http://tds.maracoos.org/thredds/iso/modis agg.nc'，
"名称"："iso"，
"服务"："iso"
}
]
```

select=['.*-agg']）
url=[s.get（"url"）for d in c.datasets for s in d.services if s.get（"service"）.lower（）=="opendap"]
打印url
[
'http://tds.maracoos.org/thredds/dodsc/modis agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis-2009-agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis-2010-agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis-2011-agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis-2012-agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis-2013-agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis one agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis three agg.nc'，
'http://tds.maracoos.org/thredds/dodsc/modis seven agg.nc'
```

如果tds
目录中有可用的信息，则返回磁盘上的大小。如果它不可用并且dap端点可用，它将返回所有thh变量的理论大小。
这不一定是磁盘上的大小，因为它不考虑"丢失的"值和"填充值"空间。

=爬网（
'http://thredds.axiomasalaska.com/thredds/catalogs/cencoos.html'，
select=['mb_*']
）
size=[d.size for d in c.datasets]
print size
[29247.41028399998，72166.28968000002]
`````

与数据集对象一起保存。它是一个etree元素对象，您可以从中提取信息。请参见[thredds元数据规范]（http://www.unidata.ucar.edu/projects/thredds/tech/catalog/v1.0.2/invcatalogspec.html metadata）

``python
from thredds豸crawler.craw import crawl
c=crawl（'http://tds.maracoos.org/thredds/modis.xml'，select=['.--agg']）
dataset=c.datasets[0]
print dataset.metadata.find（{http://www.unidata.ucar.edu/namespaces/thredds/invcatalog/v1.0}documentation"）。text
ocean color数据作为一项服务提供给更广泛的社区，并且可能受到传感器退化和/或算法变化的影响。我们努力保持这个数据集的更新和校准。这些文件中的产品是实验性的。
聚合是指定时间范围内可用数据的简单方法。自行决定使用。
```

爬网

import logging
import logging.handlers
logger=logging.getlogger（'thredds_crawler'）
fh=logging.handlers.rotatingfilehandler（'/var/log/iso_harvest/iso_harvest.log'，maxbytes=1024*1024*10，备份计数=5）
fh.setlevel（logging.debug）
ch=logging.streamhandler（）
ch.setlevel（logging.debug）
formatter=logging.formatter（'%（asctime）s-%（name）s-%（levelname）s-%（消息）s'）
fh.setformatter（formatter）
ch.setformatter（formatter）
logger.addhandler（fh）
logger.addhandler（ch）
logger.setlevel（logging.debug）

save\u dir="/srv/http/iso"

thredds_servers={
"aoos"："http://thredds.axiomsalaska.com/thredds/catalogs/aoos.html"、
"cencoos"："http://thredds.axiomsalaska.com/thredds/catalogs/cencoos.html"、
"maracoos"："http://tds.maracoos.org/thredds/catalog.html"，
"glos"："http://tds.glos.us/thredds/catalog.html"
}

对于子文件夹，thredds_servers.items（）中的thredds_url：
logger.info（"爬网%s（%s）"%（子文件夹，thredds_url））
crawler=crawler（thredds_url，debug=true）
iso s=[（d.id，s.get（"url"））for d in crawler.datasets for s in d.services if s.get（"service"）.lower（）="iso"]
filefolder=os.path.join（save_dir，subfolder）
如果不是os.path.exists（filefolder）：
os.makedirs（filefolder）
对于iso s中的iso：
尝试：
filename=iso[0].replace（"/"，"\u"）+".iso.xml"
filepath=os.path.join（filefolder，filename）
logger.info（"正在下载/保存%s"%filepath）
urllib.urlretrieve（iso[1]，filepath）
baseexcep除外操作：
logger.exception（"error！"）
```

欢迎加入QQ群-->： 979659372

thredds_crawler 1.5.4

thredds_crawler的Python项目详细描述

推荐PyPI第三方库

import-expression

trytond-project-revenue

juju-scalewa

gen_rst_readme

ez_xml

lizard-connector

odoo10-addon-account-invoice-triple-discount

pykhipu

pytest-resource

nvstrings-cuda92

django-sympa

collective.passwordwall

gitool

file_encryptor

notifyourself

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

thredds_crawler 1.5.4

thredds_crawler的Python项目详细描述

推荐PyPI第三方库

import-expression

trytond-project-revenue

juju-scalewa

gen_rst_readme

ez_xml

lizard-connector

odoo10-addon-account-invoice-triple-discount

pykhipu

pytest-resource

nvstrings-cuda92

django-sympa

collective.passwordwall

gitool

file_encryptor

notifyourself

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签