使用get_main_image函数从维基媒体提取高质量图像的困难
我在用Python写的一个脚本中遇到了问题,这个脚本是用来从维基媒体抓取图片的。问题出在一个叫get_main_image
的函数上,它下载的都是比较小的图片,而不是最高质量的版本。
下面是这个问题的简单概述:
get_main_image
函数的任务是从维基媒体获取并保存图片。- 但是,它似乎总是下载较小或较低质量的图片。
- 我的目标是修改这个函数,确保它能获取到维基媒体上最大、最清晰的图片版本。
我怀疑这个函数在识别和获取图片链接时可能存在问题,或者在选择图片质量时出了差错。
下面是get_main_image
函数的简化版本:
import requests
def get_main_image(wiki_link,article, save_dir, IMAGE_NUM):
headers = {
"Authorization": f"Bearer {access_token}",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
image_url = wiki_link + article.replace(" ", "_")
image_name = str(IMAGE_NUM + 1)
response = requests.get(image_url)
soup = bs(response.text, 'html.parser')
try:
main_image_url = soup.find('img', alt=article).get('srcset')
main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
except Exception as e:
#print(e)
try:
main_image_url = soup.find('img', alt=article).get('src')
main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
except:
return image_url, None
#print(article.replace(" ", "_")[5:])
#print(article[-4:])
if article[-4:] == ".svg":
image = Image.open(BytesIO(main_image_response.content))
image_name = image_name + ".png"
save_path = save_dir + "//" + image_name
#print(article.replace(" ", "_")[5:] + ".png")
image.save(save_path)
elif article[-5:] == ".djvu":
image = Image.open(BytesIO(main_image_response.content))
image_name = image_name + ".jpg"
save_path = save_dir + "//" + image_name
#print(article.replace(" ", "_")[5:] + ".jpg")
image.save(save_path)
else:
image = Image.open(BytesIO(main_image_response.content))
image_name = image_name + article[-4:]
save_path = save_dir + "//" + image_name
#print(article.replace(" ", "_")[5:])
#print("I haven't caused an error yet")
try:
image.save(save_path)
except Exception as e:
image_name = None
return image_url, image_name
编辑:比如说,这张图片(也就是链接中的主图)下载下来是1.09MB,而它实际上是4.33MB。 https://commons.wikimedia.org/wiki/File:Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png
1 个回答
1
如果我理解得没错,你可以使用 维基媒体共享资源的API 来获取大图的链接,比如:
import requests
from bs4 import BeautifulSoup
api_url = "https://magnus-toolserver.toolforge.org/commonsapi.php"
image_name = "Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png"
soup = BeautifulSoup(requests.get(api_url, params={"image": image_name}).content, "xml")
# print(soup.prettify())
print(soup.urls.file.text)
这段代码会输出:
https://upload.wikimedia.org/wikipedia/commons/7/7e/Map_of_Potential_Nuclear_Strike_Targets_%28c._2015%29%2C_FEMA.png