使用get_main_image函数从维基媒体提取高质量图像的困难

1 投票
1 回答
40 浏览
提问于 2025-04-14 18:09

我在用Python写的一个脚本中遇到了问题,这个脚本是用来从维基媒体抓取图片的。问题出在一个叫get_main_image的函数上,它下载的都是比较小的图片,而不是最高质量的版本。

下面是这个问题的简单概述:

  • get_main_image函数的任务是从维基媒体获取并保存图片。
  • 但是,它似乎总是下载较小或较低质量的图片。
  • 我的目标是修改这个函数,确保它能获取到维基媒体上最大、最清晰的图片版本。

我怀疑这个函数在识别和获取图片链接时可能存在问题,或者在选择图片质量时出了差错。

下面是get_main_image函数的简化版本:

import requests

def get_main_image(wiki_link,article, save_dir, IMAGE_NUM):
  headers = {
    "Authorization": f"Bearer {access_token}",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
  }
  image_url = wiki_link + article.replace(" ", "_")
  image_name = str(IMAGE_NUM + 1)
  response = requests.get(image_url)
  soup = bs(response.text, 'html.parser')

  try:
    main_image_url = soup.find('img', alt=article).get('srcset')
    main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
  except Exception as e:
    #print(e)
    try:
      main_image_url = soup.find('img', alt=article).get('src')
      main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
    except:
      return image_url, None

  #print(article.replace(" ", "_")[5:])
  #print(article[-4:])
  if article[-4:] == ".svg":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".png"
    save_path = save_dir + "//" + image_name 
    #print(article.replace(" ", "_")[5:] + ".png")
    image.save(save_path)
  elif article[-5:] == ".djvu":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".jpg"
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:] + ".jpg")
    image.save(save_path)
  else:
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + article[-4:]
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:])
    #print("I haven't caused an error yet")
    try:
      image.save(save_path)
    except Exception as e:
      image_name = None
  return image_url, image_name

编辑:比如说,这张图片(也就是链接中的主图)下载下来是1.09MB,而它实际上是4.33MB。 https://commons.wikimedia.org/wiki/File:Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png

1 个回答

1

如果我理解得没错,你可以使用 维基媒体共享资源的API 来获取大图的链接,比如:

import requests
from bs4 import BeautifulSoup

api_url = "https://magnus-toolserver.toolforge.org/commonsapi.php"
image_name = "Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png"

soup = BeautifulSoup(requests.get(api_url, params={"image": image_name}).content, "xml")
# print(soup.prettify())

print(soup.urls.file.text)

这段代码会输出:

https://upload.wikimedia.org/wikipedia/commons/7/7e/Map_of_Potential_Nuclear_Strike_Targets_%28c._2015%29%2C_FEMA.png

撰写回答