使下载科学数据更容易
data-downloader的Python项目详细描述
数据下载器
使下载科学数据更容易
1。安装
建议使用最新版本的pip安装data\u downloader。在
pip install data_downloader
2。下载程序使用
所有下载功能都在data_downloader.downloader
中。所以在开始时导入downloader
。在
2.1 Netrc
如果网站需要登录,您可以在您家中的.netrc
文件中添加一条记录,其中包含您的登录信息,以避免每次下载数据时都提供用户名和密码。在
{cd3>视图中现有的主机:
netrc=downloader.Netrc()print(netrc.hosts)
添加记录
netrc.add(self,host,login,password,account=None,overwrite=False)
如果要更新记录,请设置参数overwrite=True
对于NASA数据用户:
netrc.add('urs.earthdata.nasa.gov','your_username','your_password')
{6>你不知道该网站的主机名:
host=downloader.get_url_host(url)
删除记录
netrc.remove(self,host)
清除所有记录
netrc.clear()
示例:
In[2]:netrc=downloader.Netrc()In[3]:netrc.hostsOut[3]:{}In[4]:netrc.add('urs.earthdata.nasa.gov','username','passwd')In[5]:netrc.hostsOut[5]:{'urs.earthdata.nasa.gov':('username',None,'passwd')}In[6]:netrcOut[6]:machineurs.earthdata.nasa.govloginusernamepasswordpasswd# This command only for linux userIn[7]:!cat~/.netrcmachineurs.earthdata.nasa.govloginusernamepasswordpasswdIn[8]:url='https://gpm1.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FGPM_L3%2FGPM_3IMERGM.06%2F2000%2F3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5&FORMAT=bmM0Lw&BBOX=31.904%2C99.492%2C35.771%2C105.908&LABEL=3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5.SUB.nc4&SHORTNAME=GPM_3IMERGM&SERVICE=L34RS_GPM&VERSION=1.02&DATASET_VERSION=06&VARIABLES=precipitation'In[9]:downloader.get_url_host(url)Out[9]:'gpm1.gesdisc.eosdis.nasa.gov'In[10]:netrc.add(downloader.get_url_host(url),'username','passwd')In[11]:netrcOut[11]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordpasswdIn[12]:netrc.add(downloader.get_url_host(url),'username','new_passwd')>>>Warning:test_hostexisted,nothingwillbedone.Ifyouwanttooverwritetheexistedrecord,setoverwrite=TrueIn[13]:netrcOut[13]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordpasswdIn[14]:netrc.add(downloader.get_url_host(url),'username','new_passwd',overwrite=True)In[15]:netrcOut[15]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdmachinegpm1.gesdisc.eosdis.nasa.govloginusernamepasswordnew_passwdIn[16]:netrc.remove(downloader.get_url_host(url))In[17]:netrcOut[17]:machineurs.earthdata.nasa.govloginusernamepasswordpasswdIn[18]:netrc.clear()In[19]:netrc.hostsOut[19]:{}
2.2下载_数据
此功能是为下载单个文件而设计的。如果有很多文件要下载,请尝试使用download_datas
、mp_download_datas
或{
downloader.download_data(url,folder=None,file_name=None,client=None)
参数:
url: str
url of web file
folder: str
the folder to store output files. Default current folder.
file_name: str
the file name. If None, will parse from web response or url.
file_name can be the absolute path if folder is None.
client: httpx.Client() object
client maintaining connection. Default None
示例:
In[6]:url='http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_201...:41211.geo.unw.tif'...:...:folder='D:\\data'...:downloader.download_data(url,folder)20141117_20141211.geo.unw.tif:2%|▌|455k/22.1M[00:52<42:59,8.38kB/s]
2.3下载_数据
从包含URL的类似列表的对象中下载数据。此功能将逐个下载文件。在
downloader.download_datas(urls,folder=None,file_names=None):
p参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default current folder.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program to parse
them fromwebsite. file_names can cantain the absolute paths if folder is None.
示例:
In[12]:fromdata_downloaderimportdownloader...:...:urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20...:141211.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221...:.geo.cc.tif'] ...:...:folder='D:\\data'...:downloader.download_datas(urls,folder)20141117_20141211.geo.unw.tif:6%|█|1.37M/22.1M[03:09<2:16:31,2.53kB/s]
2.4 mp下载数据
使用多重处理同时下载文件。不支持断点恢复的网站可能下载不完整。您可以使用download_datas
downloader.mp_download_datas(urls,folder=None,file_names=None,ncore=None,desc='')
参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default current folder.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program to parse
them from website. file_names can cantain the absolute paths if folder is None.
ncore: int
Number of cores for parallel downloading. If ncore is None, then the number returned
by os.cpu_count() is used. Defalut None.
desc: str
description of datas downloading
示例:
In[12]:fromdata_downloaderimportdownloader...:...:urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20...:141211.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128...:.geo.unw.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317...:.geo.cc.tif', ...:'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221...:.geo.cc.tif'] ...:...:folder='D:\\data'...:downloader.mp_download_datas(urls,folder)>>>12paralleldownloading>>>Total|:0%||0/7[00:00<?,?it/s]20141211_20150128.geo.cc.tif:15%|██▊|803k/5.44M[00:00<?,?B/s]
2.5异步下载数据
以异步模式同时下载文件。网站可能不支持不完全下载的断点。您可以使用download_datas
downloader.async_download_datas(urls,folder=None,file_names=None,limit=30,desc='')
参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default current folder.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program
to parse them from website. file_names can cantain the absolute paths if folder is None.
limit: int
the number of files downloading simultaneously
desc: str
description of datas downloading
示例:
^{pr21}$2.6状态_ok
同时检测给定链接是否可访问。在
status_ok(urls,limit=200,timeout=60)
参数
urls: iterator
iterator contains urls
limit: int
the number of urls connecting simultaneously
timeout: int
Request to stop waiting for a response after a given number of seconds
返回:
结果列表(对或错)
示例:
In[1]:fromdata_downloaderimportdownloader...:importnumpyasnp...:...:urls=np.array(['https://www.baidu.com',...:'https://www.bai.com/wrongurl',...:'https://cn.bing.com/',...:'https://bing.com/wrongurl',...:'https://bing.com/'])...:...:status_ok=downloader.status_ok(urls)...:urls_accessable=urls[status_ok]...:print(urls_accessable)['https://www.baidu.com''https://cn.bing.com/''https://bing.com/']
3。解析\u url用法
提供了一种从各种媒体获取url的简单方法
要导入:
fromdata_downloaderimportparse_urls
3.1从\u url_文件
从只包含URL的文件中分析URL
parse_urls.from_urls_file(url_file)
参数:
url_file: str
path to file which only contains urls
返回:
列表包含URL
3.2来自“哨兵”
解析从https://scihub.copernicus.eu/dhus下载的sentinelproducts.meta4
文件中的url
parse_urls.from_sentinel_meta4(url_file)
参数:
url_file: str
path to products.meta4
返回:
列表包含URL
3.3来自_html
从html网站解析URL
parse_urls.from_html(url,suffix=None,suffix_depth=0,url_depth=0)
参数:
^{pr31}$返回:
列表包含URL
示例:
fromdownloaderimportparse_urlsurl='https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'urls=parse_urls.from_html(url,suffix=['.nc'],suffix_depth=1)urls_all=parse_urls.from_html(url,suffix=['.nc'],suffix_depth=1,url_depth=1)print(len(urls_all)-len(urls))
3.4接地探测器顺序
从earthexplorer中的订单解析URL。在
参考号:bulk-downloader
parse_urls.from_EarthExplorer_order(username=None,passwd=None,email=None,order=None,url_host=None)
参数:
username, passwd: str, optional
your username and passwd to login in EarthExplorer. Chould be
None when you have save them in .netrc
email: str, optioanl
email address for the user that submitted the order
order: str or dict
which order to download. If None, all orders retrieved from
EarthExplorer will be used.
url_host: str
if host is not USGS ESPA
返回:
格式为{orderid:url}的dict
示例:
^{pr35}$- 项目
标签: