正则表达式或只获取图像URL的方法

2024-03-28 14:18:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从下一页下载图片http://wordpandit.com/learning-bin/visual-vocabulary/page/2/ 我用urllib下载了它,并用beauthoulsoup进行了解析。它包含许多URL,我只想要那些以.jpg结尾的URL,它们还有rel=“prettypoto[gallery]”标记。 如何使用Beautifulsoup来实现这一点? 链接的Eg http://wordpandit.com/wp-content/uploads/2013/02/Obliterate.jpg

#http://wordpandit.com/learning-bin/visual-vocabulary/page/2/
import urllib
import BeautifulSoup
import lxml
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'
count=2


for count in range(1,2):
    url=baseurl+count+'/'
    soup1=BeautifulSoup.BeautifulSoup(urllib2.urlopen(url))#read will not be needed
    #find all links to imgs
    atag=soup.findAll(rel="prettyPhoto[gallery]")
    for tag in atag:
        soup2=BeautifulSoup.BeautifulSoup(tag)
        imgurl=soup2.find(href).value
        urllib2.urlopen(imgurl)

Tags: importcomhttpurlbincountpageurllib
1条回答
网友
1楼 · 发布于 2024-03-28 14:18:59

你的代码有很多不必要的东西。也许您稍后会使用它们,但是像将count指定为2然后在for range循环中使用它作为计数器的做法是没有意义的。下面是您想要的代码:

import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'

for count in range(1,2):
    url = baseurl + str(count) + "/"
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
    for tag in atag:
        if tag['href'].endswith(".jpg"):
            imgurl = tag['href']
            img = urllib2.urlopen("http://wordpandit.com" + imgurl)
            with open(imgurl.split("/")[-1], "wb") as local_file:
                local_file.write(img.read())

相关问题 更多 >