我正在尝试从以下URL中的链接下载所有pdf文件:
https://www.adb.org/projects/documents/country/ban/year/2020?terms=education
https://www.adb.org/projects/documents/country/ban/year/2019?terms=education
https://www.adb.org/projects/documents/country/ban/year/2018?terms=education
这些URL具有指向包含pdf文件的子链接的链接列表。主URL中的链接列表来自国家、年份和术语的搜索结果
我用不同的方法修改了以下代码。然而,它似乎不起作用。任何帮助都将不胜感激。谢谢
import os
import time
from glob import glob
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = ["https://www.adb.org/projects/documents/country/ban/year/2020?terms=education",
"https://www.adb.org/projects/documents/country/ban/year/2019?terms=education",
"https://www.adb.org/projects/documents/country/ban/year/2018?terms=education"]
folder = glob("J:/pdfs/*/")
for i, folder_location in zip(url, folder):
time.sleep(1)
response = requests.get(i)
soup= BeautifulSoup(response.text, "lxml")
for link in soup.select("[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(i,link['href'])).content)
试试这个。它将把文件放在PDF文件夹中
每个url中的pdf将下载到每个单独的文件夹中
相关问题 更多 >
编程相关推荐