用Python抓取RSS提要

from urllib import urlopen from BeautifulSoup import BeautifulSoup import re source = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read() title = re.compile('<title>(.*)</title>') link = re.compile('<link>(.*)</link>') find_title = re.findall(title, source) find_link = re.findall(link, source) literate = [] literate[:] = range(1, 16) for i in literate: print find_title[i] print find_link[i]

2条回答

网友
1楼 · 编辑于 2024-05-12 23:07:18

我认为您使用了错误的regex从页面中提取链接。
>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)') >>> find_link = re.findall(link, source) >>> find_link[1].strip() '"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />' >>> len(find_link) 15 >>>
查看页面的html source，您会发现链接未包含在 <link></link>模式。
实际上模式是<link rel="alternate" type="text/html" href= links here。
这就是你的regex不工作的原因。

网友
2楼 · 编辑于 2024-05-12 23:07:18

您可以使用^{} module to parse an RSS feed from a given url：
#!/usr/bin/env python import feedparser # pip install feedparser d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews') # .. skipped handling http errors, cacheing .. for e in d.entries: print(e.title) print(e.link) print(e.description) print("\n") # 2 newlines
输出
Even Critics Of Safety Net Increasingly Depend On It http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html <p>Ki Gulbranson owns a logo apparel shop, deals in  Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And Charged With Aggravated Assault And Robbery http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html <p>ATLANTA -- Atlanta police have arrested a suspect 
使用regular expressions to parse rss(xml)可能不是一个好主意。

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章