使下载科学数据更容易

data-downloader的Python项目详细描述


数据下载器

使下载科学数据更容易

1。安装

建议使用最新版本的pip安装data\u downloader。在

pip install data_downloader

2。下载程序使用

所有下载功能都在data_downloader.downloader中。所以在开始时导入downloader。在

^{pr2}$

2.1 Netrc

如果网站需要登录,您可以在您家中的.netrc文件中添加一条记录,其中包含您的登录信息,以避免每次下载数据时都提供用户名和密码。在

{cd3>视图中现有的主机:

netrc=downloader.Netrc()print(netrc.hosts)

添加记录

netrc.add(self,host,login,password,account=None,overwrite=False)

如果要更新记录,请设置参数overwrite=True

对于NASA数据用户:

netrc.add('urs.earthdata.nasa.gov','your_username','your_password')

{6>你不知道该网站的主机名:

host=downloader.get_url_host(url)

删除记录

netrc.remove(self,host)

清除所有记录

netrc.clear()

示例:

In[2]:netrc=downloader.Netrc()In[3]:netrc.hostsOut[3]:{}In[4]:netrc.add('urs.earthdata.nasa.gov','username','passwd')In[5]:netrc.hostsOut[5]:{'urs.earthdata.nasa.gov':('username',None,'passwd')}In[6]:netrcOut[6]:machineurs.earthdata.nasa.govloginusernamepasswordpasswd# This command only for linux userIn[7]:!cat~/.netrcmachineurs.earthdata.nasa.govloginusernamepasswordpasswdIn[8]:url='https://gpm1.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FGPM_L3%2FGPM_3IMERGM.06%2F2000%2F3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5&FORMAT=bmM0Lw&BBOX=31.904%2C99.492%2C35.771%2C105.908&LABEL=3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5.SUB.nc4&SHORTNAME=GPM_3IMERGM&SERVICE=L34RS_GPM&VERSION=1.02&DATASET_VERSION=06&VARIABLES=precipitation'In[9]:downloader.get_url_host(url)Out[9]:'gpm1.gesdisc.eosdis.nasa.gov'In[10]:netrc.add(downloader.get_url_host(url),'username','passwd')In[11]:netrcOut[11]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordpasswdIn[12]:netrc.add(downloader.get_url_host(url),'username','new_passwd')>>>Warning:test_hostexisted,nothingwillbedone.Ifyouwanttooverwritetheexistedrecord,setoverwrite=TrueIn[13]:netrcOut[13]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordpasswdIn[14]:netrc.add(downloader.get_url_host(url),'username','new_passwd',overwrite=True)In[15]:netrcOut[15]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordnew_passwdIn[16]:netrc.remove(downloader.get_url_host(url))In[17]:netrcOut[17]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdIn[18]:netrc.clear()In[19]:netrc.hostsOut[19]:{}

2.2下载_数据

此功能是为下载单个文件而设计的。如果有很多文件要下载,请尝试使用download_datasmp_download_datas或{}函数

downloader.download_data(url,folder=None,file_name=None,client=None)

参数:

url: str
    url of web file
folder: str
    the folder to store output files. Default current folder. 
file_name: str
    the file name. If None, will parse from web response or url.
    file_name can be the absolute path if folder is None.
client: httpx.Client() object
    client maintaining connection. Default None

示例:

In[6]:url='http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_201...:41211.geo.unw.tif'...:...:folder='D:\\data'...:downloader.download_data(url,folder)20141117_20141211.geo.unw.tif:2%||455k/22.1M[00:52<42:59,8.38kB/s]

2.3下载_数据

从包含URL的类似列表的对象中下载数据。此功能将逐个下载文件。在

downloader.download_datas(urls,folder=None,file_names=None):

p参数:

urls:  iterator
    iterator contains urls
folder: str
    the folder to store output files. Default current folder.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program to parse 
    them fromwebsite. file_names can cantain the absolute paths if folder is None.

示例:

In[12]:fromdata_downloaderimportdownloader...:...:urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20...:141211.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221...:.geo.cc.tif']  ...:...:folder='D:\\data'...:downloader.download_datas(urls,folder)20141117_20141211.geo.unw.tif:6%||1.37M/22.1M[03:09<2:16:31,2.53kB/s]

2.4 mp下载数据

使用多重处理同时下载文件。不支持断点恢复的网站可能下载不完整。您可以使用download_datas

downloader.mp_download_datas(urls,folder=None,file_names=None,ncore=None,desc='')

参数:

urls:  iterator
    iterator contains urls
folder: str
    the folder to store output files. Default current folder.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program to parse
    them from website. file_names can cantain the absolute paths if folder is None.
ncore: int
    Number of cores for parallel downloading. If ncore is None, then the number returned
    by os.cpu_count() is used. Defalut None.
desc: str
    description of datas downloading

示例:

In[12]:fromdata_downloaderimportdownloader...:...:urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20...:141211.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221...:.geo.cc.tif']  ...:...:folder='D:\\data'...:downloader.mp_download_datas(urls,folder)>>>12paralleldownloading>>>Total|:0%||0/7[00:00<?,?it/s]20141211_20150128.geo.cc.tif:15%|██▊|803k/5.44M[00:00<?,?B/s]

2.5异步下载数据

以异步模式同时下载文件。网站可能不支持不完全下载的断点。您可以使用download_datas

downloader.async_download_datas(urls,folder=None,file_names=None,limit=30,desc='')

参数:

urls:  iterator
    iterator contains urls
folder: str 
    the folder to store output files. Default current folder.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program 
    to parse them from website. file_names can cantain the absolute paths if folder is None.
limit: int
    the number of files downloading simultaneously
desc: str
    description of datas downloading

示例:

^{pr21}$

2.6状态_ok

同时检测给定链接是否可访问。在

status_ok(urls,limit=200,timeout=60)

参数

urls: iterator
    iterator contains urls
limit: int
    the number of urls connecting simultaneously
timeout: int
    Request to stop waiting for a response after a given number of seconds

返回:

结果列表(对或错)

示例:

In[1]:fromdata_downloaderimportdownloader...:importnumpyasnp...:...:urls=np.array(['https://www.baidu.com',...:'https://www.bai.com/wrongurl',...:'https://cn.bing.com/',...:'https://bing.com/wrongurl',...:'https://bing.com/'])...:...:status_ok=downloader.status_ok(urls)...:urls_accessable=urls[status_ok]...:print(urls_accessable)['https://www.baidu.com''https://cn.bing.com/''https://bing.com/']

3。解析\u url用法

提供了一种从各种媒体获取url的简单方法

要导入:

fromdata_downloaderimportparse_urls

3.1从\u url_文件

从只包含URL的文件中分析URL

parse_urls.from_urls_file(url_file)

参数:

url_file: str
    path to file which only contains urls 

返回:

列表包含URL

3.2来自“哨兵”

解析从https://scihub.copernicus.eu/dhus下载的sentinelproducts.meta4文件中的url

parse_urls.from_sentinel_meta4(url_file)

参数:

url_file: str
    path to products.meta4

返回:

列表包含URL

3.3来自_html

从html网站解析URL

parse_urls.from_html(url,suffix=None,suffix_depth=0,url_depth=0)

参数:

^{pr31}$

返回:

列表包含URL

示例:

fromdownloaderimportparse_urlsurl='https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'urls=parse_urls.from_html(url,suffix=['.nc'],suffix_depth=1)urls_all=parse_urls.from_html(url,suffix=['.nc'],suffix_depth=1,url_depth=1)print(len(urls_all)-len(urls))

3.4接地探测器顺序

从earthexplorer中的订单解析URL。在

参考号:bulk-downloader

parse_urls.from_EarthExplorer_order(username=None,passwd=None,email=None,order=None,url_host=None)

参数:

username, passwd: str, optional
    your username and passwd to login in EarthExplorer. Chould be
    None when you have save them in .netrc
email: str, optioanl
    email address for the user that submitted the order
order: str or dict
    which order to download. If None, all orders retrieved from 
    EarthExplorer will be used.
url_host: str
    if host is not USGS ESPA

返回:

格式为{orderid:url}的dict

示例:

^{pr35}$

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
IntelliJ中的java默认Maven项目结构不一致   java我希望链接(在帖子和页面上)在一些访问者加载时被自动点击   java如何使用单独的方法隐藏JButton并在新类中调用   java KStream leftJoin KStream具有相同的密钥   java图像在垂直滚动窗格视图端口中消失   java从指定的起始点开始以n的增量填充数组   java JLabel和JTextField不在swing表单中应用   java springboot mongo如果没有映像,请使用现有映像   类似C++映射的java容器   java如何在没有Valgrind错误的情况下调用JNI_CreateJavaVM?   java如何在安卓中运行后台服务   java onPostExecute不运行