使用BeautifulSoup从“img”标记提取“src”属性

2024-04-20 13:06:35 发布

您现在位置:Python中文网/ 问答频道 /正文

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

我使用bs4并且我不能使用a.attrs['src']来获得src,但是我可以获得href。我该怎么办?


Tags: divsrcimgsomealtattrsclasshref
2条回答

链接没有属性src必须以实际的img标记为目标。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

可以使用BeautifulSoup提取html img标记的src属性。在我的示例中,htmlText包含img标记本身,但这也可以与urllib2一起用于URL。

对于URL

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

对于带有img标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

相关问题 更多 >