用BeautifulSoup抓取Craiglist并在每个帖子中获得第一张图片

2024-06-07 13:46:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正试图从craigslist上获取航空数据。除了每篇文章的第一张图片,我没有问题获得我想要的所有信息。这是我的链接:

https://spokane.craigslist.org/search/avo?hasPic=1

我已经能够得到所有的图片感谢一个不同的职位在这个网站上,但我有困难,找出如何只得到第一张图片

我正在使用bs4并请求此脚本。以下是我迄今为止获得的所有图像:

from bs4 import BeautifulSoup as bs
import requests

image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = bs(r.content, 'lxml')
ids = [item['data-ids'].replace('1:','') for item in soup.select('.result-image[data-ids]', limit = 10)] 
images = [image_url.format(j) for i in ids for j in i.split(',')]
print(images)

非常感谢您的帮助

提前感谢,

英泽尔


Tags: inhttpsorgimageimportidsforsearch
3条回答

以下是一个简洁明了的解决方案:

from pprint import pprint

import requests
from bs4 import BeautifulSoup

base_image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = BeautifulSoup(r.content, 'lxml')

results = []
for elem in soup.find_all("a", attrs={"class": "result-image gallery"})[:2]:
    listing_url = elem.get("href")
    image_urls = []
    image_ids = elem.get("data-ids")
    if image_ids:
        image_urls = [base_image_url.format(curr_id[2:]) for curr_id in image_ids.split(",")]
    results.append((listing_url, image_urls))

pprint(results)

输出:

[('https://spokane.craigslist.org/avo/d/spokane-lightspeed-sierra-headset/7090771925.html',
  ['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg',
   'https://images.craigslist.org/00q0q_5ax4n1nCwmI_300x300.jpg',
   'https://images.craigslist.org/00202_pAcLlJsaR3_300x300.jpg',
   'https://images.craigslist.org/00z0z_kCZGUL6WZZw_300x300.jpg',
   'https://images.craigslist.org/00G0G_8A2Xg7Wbe7B_300x300.jpg',
   'https://images.craigslist.org/00f0f_cNpP8ZfUXdU_300x300.jpg']),
 ('https://spokane.craigslist.org/avo/d/spokane-window-mounted-air-conditioner/7090361383.html',
  ['https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg',
   'https://images.craigslist.org/00I0I_lxNKJsQAT7X_300x300.jpg',
   'https://images.craigslist.org/00t0t_3BeBsNO6xH6_300x300.jpg',
   'https://images.craigslist.org/00L0L_aPnbejSiXQp_300x300.jpg'])]

如果您有任何问题,请告诉我:)

您需要找到图像库中的所有类,然后获取数据ID。 然后将它们拆分为一个列表,并获取第一个元素[0]

from bs4 import BeautifulSoup as bs
import requests

image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = bs(r.content, 'lxml')
ids = [item.get('data-ids').replace('1:','') for item in soup.findAll("a", {"class": "result-image gallery"}, limit=10)] 
images = [image_url.format(i.split(',')[0]) for i in ids]
print(images)

结果:

['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg','https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg','https://images.craigslist.org/00n0n_8zVXHONPkTH_300x300.jpg','https://images.craigslist.org/00l0l_jiNMe38avtl_300x300.jpg','https://images.craigslist.org/01212_fULyvfO9Rqz_300x300.jpg','https://images.craigslist.org/00D0D_ibbWWn7uFCu_300x300.jpg','https://images.craigslist.org/00z0z_2ylVbmdVnPr_300x300.jpg','https://images.craigslist.org/00Q0Q_ha0o2IJwj4Q_300x300.jpg','https://images.craigslist.org/01212_5LoZU43xA7r_300x300.jpg','https://images.craigslist.org/00U0U_7CMAu8vAhDi_300x300.jpg']

from bs4 import BeautifulSoup
import requests

r = requests.get("https://spokane.craigslist.org/search/avo?hasPic=1")
soup = BeautifulSoup(r.text, 'html.parser')

img = "https://images.craigslist.org/"


imgs = [f"{img}{item.get('data-ids').split(':')[1].split(',')[0]}_300x300.jpg"
        for item in soup.findAll("a", class_="result-image gallery")]

print(imgs)

输出:

['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg', 'https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg', 'https://images.craigslist.org/00n0n_8zVXHONPkTH_300x300.jpg', 'https://images.craigslist.org/00l0l_jiNMe38avtl_300x300.jpg', 'https://images.craigslist.org/00q0q_l4hts9RPOuk_300x300.jpg', 'https://images.craigslist.org/00D0D_ibbWWn7uFCu_300x300.jpg', 'https://images.craigslist.org/00z0z_2ylVbmdVnPr_300x300.jpg', 'https://images.craigslist.org/00Q0Q_ha0o2IJwj4Q_300x300.jpg', 'https://images.craigslist.org/01212_5LoZU43xA7r_300x300.jpg', 'https://images.craigslist.org/00U0U_7CMAu8vAhDi_300x300.jpg', 'https://images.craigslist.org/00m0m_8c7azYhDR1Z_300x300.jpg', 'https://images.craigslist.org/00E0E_7k7cPL7zNnP_300x300.jpg', 'https://images.craigslist.org/00I0I_97AZy8UMt5V_300x300.jpg', 'https://images.craigslist.org/00G0G_iWw8AI8N8Kf_300x300.jpg', 'https://images.craigslist.org/00m0m_9BEEcvD0681_300x300.jpg', 'https://images.craigslist.org/01717_4Ut5FSIdoi3_300x300.jpg', 'https://images.craigslist.org/00h0h_jeAhtDXW2ST_300x300.jpg', 'https://images.craigslist.org/00T0T_hTogH4m9zTH_300x300.jpg', 'https://images.craigslist.org/01212_9x1EFI1CYHE_300x300.jpg', 'https://images.craigslist.org/00H0H_kiXLOtVgReA_300x300.jpg', 'https://images.craigslist.org/00P0P_ad77Eqvf1ul_300x300.jpg', 'https://images.craigslist.org/00909_jyBoTCNGmAJ_300x300.jpg', 'https://images.craigslist.org/00g0g_gFtJlANhi51_300x300.jpg', 'https://images.craigslist.org/00202_3LV7YERBssE_300x300.jpg', 'https://images.craigslist.org/00j0j_3zxT682nE2i_300x300.jpg', 'https://images.craigslist.org/00Y0Y_b6AXcApcSfl_300x300.jpg', 'https://images.craigslist.org/00M0M_6eTHo5E3Ee5_300x300.jpg', 'https://images.craigslist.org/00g0g_hvyvJKUejXY_300x300.jpg', 'https://images.craigslist.org/00I0I_d2WOWXtgQ8s_300x300.jpg', 'https://images.craigslist.org/00s0s_dAwJG0D6uce_300x300.jpg', 'https://images.craigslist.org/00g0g_TC2qvnD3AN_300x300.jpg', 'https://images.craigslist.org/00M0M_Dba39RfEkr_300x300.jpg', 'https://images.craigslist.org/00M0M_31drxF6c9vO_300x300.jpg', 'https://images.craigslist.org/00505_jOjMq3B8y0M_300x300.jpg', 'https://images.craigslist.org/00e0e_ixfV647qwLh_300x300.jpg', 'https://images.craigslist.org/00p0p_i2noTC4cADw_300x300.jpg', 'https://images.craigslist.org/00a0a_kywatxfm6Ud_300x300.jpg', 'https://images.craigslist.org/00808_1ZjIIX8PdaP_300x300.jpg', 'https://images.craigslist.org/01515_blEEDKbbyKD_300x300.jpg', 'https://images.craigslist.org/00b0b_brUn6sUxBzF_300x300.jpg', 'https://images.craigslist.org/00U0U_2ukBvcgvU99_300x300.jpg', 'https://images.craigslist.org/01212_dPTe5ZHM26A_300x300.jpg', 'https://images.craigslist.org/00B0B_1GsE81zVsr0_300x300.jpg', 'https://images.craigslist.org/00N0N_l8SXlBaI8lq_300x300.jpg', 'https://images.craigslist.org/00f0f_82qAzPq7cXd_300x300.jpg', 'https://images.craigslist.org/00w0w_lUrgFG9YOY0_300x300.jpg', 'https://images.craigslist.org/00C0C_kiZpgrFEnO8_300x300.jpg', 'https://images.craigslist.org/00T0T_g7IHvHMx14L_300x300.jpg', 'https://images.craigslist.org/00E0E_bzm9jRXpWVd_300x300.jpg', 'https://images.craigslist.org/00k0k_lOCRF1fgWCF_300x300.jpg', 'https://images.craigslist.org/00y0y_exwReppAi3L_300x300.jpg', 'https://images.craigslist.org/01515_7xyZ605hYcc_300x300.jpg', 'https://images.craigslist.org/00J0J_hqLMLvTCfXk_300x300.jpg', 'https://images.craigslist.org/00505_3P0xQrbeFY4_300x300.jpg', 'https://images.craigslist.org/00r0r_gj6dO6ZHO8L_300x300.jpg', 'https://images.craigslist.org/01717_cIVmzgKCWtP_300x300.jpg', 'https://images.craigslist.org/00w0w_6O59k6qlZQz_300x300.jpg', 'https://images.craigslist.org/00808_jd43ZthN1uB_300x300.jpg', 'https://images.craigslist.org/00m0m_1GJ41cKvv4Y_300x300.jpg']

该列表包含每个帖子的第一张图片

相关问题 更多 >

    热门问题