Python RSS Web抓取选择正确的元素

import urllib2 from urllib2 import urlopen import re import cookielib from cookielib import CookieJar import time cj = CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent','Mozilla/5.0')] def main(): try: page = 'http://feeds.link.co.uk/thelink/rss.xml' sourceCode = opener.open(page).read() try: titles = re.findall(r'<title>(.*?)</title>',sourceCode) desc = re.findall(r'<description>(.*?)</description>',sourceCode) links = re.findall(r'<link>(.*?)</link>',sourceCode) pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode) for i in range(len(titles)): print titles[i] print desc[i] print links[i] print pub[i] print "" except Exception, e: print str(e) except Exception, e: print str(e) main()

3条回答

网友

1楼 · 编辑于 2024-04-20 11:42:01

您应该使用正确的xml解析器，比如Beautiful Soup，而不是regex。在

from bs4 import BeautifulSoup

data = sourceCode # your sourceCode variable from your main() function

soup = BeautifulSoup(data)
for item in soup.find_all('item'):
    for tag in ['title', 'description', 'link', 'pubdate']:
        print(tag.upper(), item.find(tag).text)
    print()

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-04-20 11:42:01

好吧，我能说什么呢？？？？在

BeautifulSoup本可以帮我省去很多打字：）

import urllib2
from bs4 import BeautifulSoup
url = "http://feeds.link.co.uk/thelink/rss.xml"
sourceCode = urllib2.urlopen(url).read()

data = sourceCode 

soup = BeautifulSoup(data)
for item in soup.find_all('item'):
    for tag in ['title', 'description', 'link', 'pubdate']:
        print(tag.upper(), item.find(tag).text)
    print()

网友

3楼 · 编辑于 2024-04-20 11:42:01

你试过用beauthoulsoup4吗？找到你想要的元素会容易得多。在

用这样的代码：

title = soup.find('title')
if title:
    print title.text

另外，为了避免“元素超出范围错误”，可以先检查列表中是否有足够的元素：

^{pr2}$

我希望这有帮助：）

相关问题更多 >

编程相关推荐

热门问题

热门文章