BeautifulSoup未按预期获取'image src

0 投票
3 回答
1276 浏览
提问于 2025-04-18 01:05

我正在尝试使用BeautifulSoup从必应的图片搜索结果中提取图片的链接。

一开始,这个过程的表现是正常的:

from bs4 import BeautifulSoup
import requests

def get_soup(url):
    return BeautifulSoup(requests.get(url).text)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
      "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

不过,接下来的代码却返回了一个空列表,而不是我想要的链接列表:

bimg = re.compile("mm.bing.net")
img_links = soup.find_all("img", {"src": bimg})
print img_links

当我使用print soup.prettify()时,我能看到我想要的链接。看起来所有的图片标签可能都在一个脚本里面——这会不会导致BeautifulSoup看不到它们呢?

这里是一些包含链接的格式化输出。

<script type="text/javascript">
  //<![CDATA[
var t = '<div class="iol_fp" id="iol_bg"></div><div id="iol_ph"></div><div id="iol_dp"><button id="iol_cls" title="Close"></button><div id="iol_ip"><div id="iol_imp">
<div id="iol_imw"></div><div class="iol_nav" id="iol_navl"></div><div class="iol_nav" id="iol_navr"></div></div><div id="iol_mdb"><span class="iol_mdi" id="iol_md"><span id="iol_mdis"></span><span id="iol_sep">·</span><a id="iol_mdit"></a></span>
<span id="iol_bspan"><button class="iol_mdi" id="iol_pin" href="#" title="Pin to Pinterest"></button><button class="iol_mdi" id="iol_vl" href="#">Show larger</button><button class="iol_mdi" id="iol_vs" href="#">Show smaller</button>
<button class="iol_mdi" id="iol_ss" href="#">Play All</button><button class="iol_mdi" id="iol_sse" href="#">Pause</button></span></div><div id="iol_fsw"><div id="iol_fscb"></div><div id="iol_fsc"></div></div></div><div id="iol_sp"><div id="iol_rs">
<div id="iol_rst">ALSO CONSIDER</div><span id="iol_rsp"><div><div class="iol_rsc"><a href="/images/search?q=Doggy+GIF+Style+1+2+3&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggy GIF Style 1 2 3" h="ID=images,5187.2">
<img src="http://ts4.mm.bing.net/th?q=Doggy+GIF+Style+1+2+3&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq">Doggy<br/><strong>GIF Style 1 2 3</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Puppies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Puppies" h="ID=images,5189.2"><img src="http://ts1.mm.bing.net/th?q=Puppies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Puppies</strong></span></a></div><div class="iol_rsc"><a href="/images/search?q=Funny+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Funny Doggies" h="ID=images,5191.2">
<img src="http://ts4.mm.bing.net/th?q=Funny+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Funny</strong><br/>Doggies</span></a></div><div class="iol_rsc"><a href="/images/search?q=Doggie+Dentures&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggie Dentures" h="ID=images,5193.2">
<img src="http://ts1.mm.bing.net/th?q=Doggie+Dentures&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Doggie Dentures</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Cute+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Cute Doggies" h="ID=images,5195.2"><img src="http://ts3.mm.bing.net/th?q=Cute+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Cute</strong><br/>Doggies

非常感谢任何帮助!

3 个回答

0
from bs4 import BeautifulSoup
import requests
import re

def get_soup(url):
    request = requests.get(url).content
    return BeautifulSoup(request)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query + "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
bimg = re.compile('.*mm.bing.net.*')
img_links = soup.find_all("img", {'src': bimg})
for link in img_links:
    print link
<img src="http://ts3.mm.bing.net/th?q=Rabbit&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cow&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Tiger&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Elephant&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Fish&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Fox&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Animal&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Chicken+Bird&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Domestic+Sheep&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Giraffe&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Puppy&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Dolphin&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Pet&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Baby+Birds&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Labrador+Retriever&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Chihuahua&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cat&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Lion&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Zebra&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Bulldog&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>

我稍微调整了一下你的正则表达式。

0
import urllib, bs4
from bs4 import *

url = "http://www.bing.com/images/search?q=%s&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3" % 'doggy'

html_page = urllib.urlopen(url)
soup = BeautifulSoup(html_page)

links = soup.find_all("img")

img_links = []

for link in links:
    img_links.append(str(link.get('src')))

for x in range(0, 10):  
    for x in range(0, len(img_links)):
        try:
            if "http://" in img_links[x]:
                pass
            else:
                del img_links[x]
        except:
            break

试试这个。

这些链接应该在列表 img_links 里面。

1

@alecxe说得对——这是一个关于html5的问题。我安装了html5lib这个库,然后下面的代码解决了这个问题:

from bs4 import BeautifulSoup
import requests
import html5lib

def get_soup(url):
   return BeautifulSoup(requests.get(url).text, 'html5lib')

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
  "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

谢谢你的帮助。

撰写回答