使用Python抓取RSS订阅源

3 投票

2 回答

6940 浏览

提问于 2025-04-17 12:54

我刚开始学习Python和编程，所以如果我的问题很傻，请多包涵。

我在一步一步跟着这个关于RSS抓取的教程，但在尝试获取文章标题对应的链接时，Python给我报了一个“列表索引超出范围”的错误。

这是我的代码：

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')

find_title = re.findall(title, source)
find_link = re.findall(link, source)

literate = []
literate[:] = range(1, 16)

for i in literate:
    print find_title[i]
    print find_link[i]

当我只让它获取标题时，代码运行得很好，但当我想同时获取标题和它们对应的链接时，就立刻出现了索引错误。

任何帮助都会非常感谢。

错误处理网络编程数据解析数据抓取列表索引编程入门教程学习 rss抓取

2 个回答

你可以使用 feedparser 模块来从指定的网址解析RSS源：

#!/usr/bin/env python
import feedparser # pip install feedparser

d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews')
# .. skipped handling http errors, cacheing ..

for e in d.entries:
    print(e.title)
    print(e.link)
    print(e.description)
    print("\n") # 2 newlines

输出结果

Even Critics Of Safety Net Increasingly Depend On It
http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html
<p>Ki Gulbranson owns a logo apparel shop, deals in 
<!-- ... snip ... -->

Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And
Charged With Aggravated Assault And Robbery
http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html
<p>ATLANTA -- Atlanta police have arrested a suspect 
<!-- ... snip ... -->

用正则表达式来解析RSS（XML）可能不是个好主意。

回答于 2025-04-17 由 Python大师

分享举报

我觉得你用的正则表达式不对，提取网页链接时出了问题。

>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>

看看你网页的html源代码，你会发现链接并不是用<link></link>这种格式包裹的。

实际上，正确的格式是<link rel="alternate" type="text/html" href= 链接在这里。

这就是你正则表达式不工作的原因。

回答于 2025-04-17 由 Python大师

分享举报

使用Python抓取RSS订阅源

2 个回答

输出结果

撰写回答