使用Selenium WebDriver将文件下载至指定路径
我需要把一个文件下载到一个非本地机器上的指定位置。这是我在网页浏览器中通常的操作流程:
- 访问网站
- 点击下载文件的按钮(这个按钮是一个表单,会生成文件,而不是一个直接的下载链接)
- 网站会弹出一个提示窗口:“你想下载这个文件吗?”等等。
我想要跳过这个文件下载的步骤,直接做一些类似于:
>>> path_to_download_path = PATH
>>> button = driver.find_element_by_css("...")
>>> button.click()
--> And the file is automatically downloaded to my PATH (or wherever I choose)
或者有没有更简单的方法,比如用 click
,让我可以自动下载文件的内容?
我该怎么做呢?
3 个回答
当你初始化你的驱动程序时,记得设置下载的偏好设置。
对于Firefox浏览器:
ff_prof.set_preference( "browser.download.manager.showWhenStarting", False )
ff_prof.set_preference( "browser.download.folderList", 2 )
ff_prof.set_preference( "browser.download.useDownloadDir", True )
ff_prof.set_preference( "browser.download.dir", self.driver_settings['download_folder'] )
##
# if FF still shows the download dialog, make sure that the filetype is included below
# filetype string options can be found in '~/.mozilla/$USER_PROFILE/mimeTypes.rdf'
##
mime_types = ("application/pdf", "text/html")
ff_prof.set_preference( "browser.helperApps.neverAsk.saveToDisk", (", ".join( mime_types )) )
ff_prof.set_preference( "browser.helperApps.neverAsk.openFile", (", ".join( mime_types )) )
对于Chrome浏览器:
capabilities['chromeOptions']['prefs']['download.prompt_for_download'] = False
capabilities['chromeOptions']['prefs']['download.default_directory'] = self.driver_settings['download_folder']
转发下载:
下面是我用来把文件从 self.driver_settings['download_folder']
(上面设置的)转移到你实际想要的地方的代码(to_path
可以是一个已经存在的文件夹或文件路径)。如果你使用的是Linux系统,我建议使用 tmpfs
,这样 /tmp
会保存在内存中,然后把 self.driver_settings['download_folder']
设置为 "/tmp/driver_downloads/"
。请注意,下面的函数假设 self.driver_settings['download_folder']
一开始是一个空文件夹(这样它才能找到正在下载的文件,因为这是目录中唯一的文件)。
def moveDriverDownload(self, to_path, allowable_extensions, allow_rename_if_exists=False, timeout_seconds=None):
if timeout_seconds is None:
timeout_seconds = 30
wait_delta = timedelta( seconds=timeout_seconds )
start_download_time = datetime.now()
hasTimedOut = lambda: datetime.now() - start_download_time > wait_delta
assert isinstance(allowable_extensions, list) or isinstance(allowable_extensions, tuple) or isinstance(allowable_extensions, set), "instead of a list, found allowable_extensions type of '{}'".format(type(allowable_extensions))
allowable_extensions = [ elem.lower().strip() for elem in allowable_extensions ]
allowable_extensions = [ elem if elem.startswith(".") else "."+elem for elem in allowable_extensions ]
if not ".part" in allowable_extensions:
allowable_extensions.append( ".part" )
re_extension_str = "(?:" + ("$)|(?:".join( re.escape(elem) for elem in allowable_extensions )) + "$)"
getFiles = lambda: next( os.walk( self.driver_settings['download_folder'] ) )[2]
while True:
if hasTimedOut():
del allowable_extensions[ allowable_extensions.index(".part") ]
raise DownloadTimeoutError( "timed out after {} seconds while waiting on file download with extension in {}".format(timeout_seconds, allowable_extensions) )
time.sleep( 0.5 )
file_list = [ elem for elem in getFiles() if re.search( re_extension_str, elem ) ]
if len(file_list) > 0:
break
file_list = [ re.search( r"(?i)^(.*?)(?:\.part)?$", elem ).groups()[0] for elem in file_list ]
if len(file_list) > 1:
if len(file_list) == 2:
if file_list[0] != file_list[1]:
raise Exception( "file_list[0] != file_list[1] <==> {} != {}".format(file_list[0], file_list[1]) )
else:
raise Exception( "len(file_list) > 1. found {}".format(file_list) )
file_path = "%s%s" %(self.driver_settings['download_folder'], file_list[0])
# see if the file is still being downloaded by checking if it's open by any programs
if platform.system() == "Linux":
openProcess = lambda: subprocess.Popen( 'lsof | grep "%s"' %file_path, shell=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE )
fileIsFinished = lambda txt: txt.strip() == ""
elif platform.system() == "Windows":
# 'handle' program must be in PATH
# https://technet.microsoft.com/en-us/sysinternals/bb896655
openProcess = lambda: subprocess.Popen( 'handle "%s"' %file_path.replace("/", "\\"), shell=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE )
fileIsFinished = lambda txt: bool( re.search("(?i)No matching handles found", txt) )
else:
raise Exception( "unrecognised platform.system() of '{}'".format(platform.system()) )
while True:
lsof_process = openProcess()
lsof_result = lsof_process.communicate()
if len(lsof_result) != 2:
raise Exception( "len(lsof_result) != 2. found {}".format(lsof_result) )
if lsof_result[1].strip() != "":
raise Exception( 'lsof_result[1].strip() != "". found {}'.format(lsof_result) )
if fileIsFinished( lsof_result[0] ):
break
if hasTimedOut():
raise Exception( "timed out after {} seconds waiting for '{}' to be freed from writing. found lsof/handle of '{}'".format(timeout_seconds, file_path, lsof_result[0]) )
time.sleep( 0.5 )
to_path = to_path.replace("\\", "/")
if os.path.isdir( to_path ):
if not to_path.endswith("/"):
to_path += "/"
to_path += file_list[0]
i = 2
while os.path.exists( to_path ):
if not allow_rename_if_exists:
raise Exception( "{} already exists".format(to_path) )
to_path = re.sub( "^(.*/)(.*?)(?:-" + str(i-1) + r")?(|\..*?)?$", r"\1\2-%i\3" %i, to_path )
i += 1
shutil.move( file_path, to_path )
return to_path[ to_path.rindex("/")+1: ]
使用selenium webdriver
使用火狐浏览器的个人资料来下载文件。这个个人资料可以跳过火狐的对话框。
在代码行中:
pro.setPreference("browser.downLoad.folderList", 0);
浏览器下载文件夹的设置可以是0、1或2。设置为0时,火狐会把所有下载的文件保存在用户的桌面上。设置为1时,这些下载的文件会存放在“下载”文件夹里。设置为2时,会再次使用最近一次下载的文件位置。
你需要实现的火狐个人资料代码是:
FirefoxProfile pro=new FirefoxProfile();
pro.setPreference("browser.downLoad.folderList", 0);
pro.setPreference("browser.helperApps.neverAsk.saveToDisk", "Applications/zip");
WebDriver driver=new FirefoxDriver(pro);
driver.get("http://selenium-release.storage.googleapis.com/2.47/selenium-java-2.47.1.zip");
希望这对你有帮助 :)
在你想要修改网站上的JavaScript代码之前,首先得先了解这些代码是怎么工作的。不过即使你搞明白了,浏览器的安全设置也会弹出一个对话框,让你确认是否要下载文件。所以你大概有两个选择:
- 确认那个弹出的警告对话框
- 找出文件在远程服务器上的位置,然后用GET请求来下载这个文件
我对这两种方法的具体细节帮不了你,因为我不懂Python,但希望这些信息能对你有帮助...