我下面的脚本很好用。它基本上就是从一个给定的网站找到我感兴趣的所有数据文件,检查它们是否已经在我的计算机上(如果有的话跳过它们),最后用cURL把它们下载到我的计算机上。在
我遇到的问题是有时有400多个非常大的文件,我不能同时下载它们。我将按Ctrl-C,但它似乎取消了cURL下载,而不是脚本,因此我需要逐个取消所有下载。有办法吗?也许不知何故,我可以在当前下载结束时停止执行一个键命令?在
#!/usr/bin/python
import os
import urllib2
import re
import timeit
filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"
#connect to a URL
website = urllib2.urlopen("http://somewebsite")
#read html code
html = website.read()
#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)
#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
if files.endswith(".mat"):
try:
filenames.remove(files)
count += 1
except ValueError:
countpass += 1
print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)
#saves the file names into an array of html link
links=len(filenames)*[0]
for j in range(len(filenames)):
links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]
for i in range(len(links)):
os.system("curl -o "+ filenames[i] + " " + str(links[i]))
print "links downloaded:",len(links)
在下载之前,您可以使用curl检查文件大小:
相关问题 更多 >
编程相关推荐