Python，BeautifulSoup还是LXML - 用CSS标签解析HTML中的图片URL

2 投票

3 回答

2594 浏览

提问于 2025-04-16 07:30

我到处寻找关于BeautifulSoup或LXML如何工作的好解释。虽然它们的文档写得很好，但对于像我这样的Python和编程新手来说，理解起来还是挺困难的。

不过，作为我的第一个项目，我正在用Python解析一个RSS源来获取帖子链接——我已经用Feedparser完成了这一步。接下来，我打算抓取每个帖子的图片。但是，我就是搞不懂怎么用BeautifulSoup或LXML来实现我想要的功能！我花了好几个小时看文档和在网上搜索，但都没找到答案，所以我来这里求助。以下是我想抓取的内容的一部分。

<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>

根据我对文档的理解，我应该能够传入以下内容：

soup.find("a", { "class" : "bpImage" })

这样就可以找到所有带有那个CSS类的实例。但是，它什么都没返回。我肯定是忽略了什么简单的东西，所以非常感谢大家的耐心。

非常感谢你们的回复！

为了方便将来搜索的朋友，我会把我的feedparser代码放在这里：

#! /usr/bin/python

# RSS Feed Parser for the Big Picture Blog

# Import applicable libraries

import feedparser

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
pointer = 0
while pointer < posts:
    e = d.entries[pointer]
    print e.link
    pointer = pointer + 1

lxml html解析 beautifulsoup 数据抓取编程新手图片链接 CSS选择器 rss抓取

3 个回答

使用pyparsing来搜索标签是相当简单明了的：

from pyparsing import makeHTMLTags, withAttribute

imgTag,notused = makeHTMLTags('img')

# only retrieve <img> tags with class='bpImage'
imgTag.setParseAction(withAttribute(**{'class':'bpImage'}))

for img in imgTag.searchString(html):
    print img.src

回答于 2025-04-16 由 Python大师

分享举报

你发的代码是用来查找所有带有 bpImage 类的 a 元素。但是你给的例子中，bpImage 类是在 img 元素上，而不是在 a 元素上。你只需要这样做：

soup.find("img", { "class" : "bpImage" })

回答于 2025-04-16 由 Python大师

分享举报

使用lxml库，你可以这样做：

import feedparser
import lxml.html as lh
import urllib2

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
for post in d['entries']:
    link=post['link']
    print('Parsing {0}'.format(link))
    doc=lh.parse(urllib2.urlopen(link))
    imgs=doc.xpath('//img[@class="bpImage"]')
    for img in imgs:
        print(img.attrib['src'])

回答于 2025-04-16 由 Python大师

分享举报

Python，BeautifulSoup还是LXML - 用CSS标签解析HTML中的图片URL

3 个回答

撰写回答