使用lxml在Python中遍历节点
我有一个网页,正在用BeautifulSoup来解析,但速度有点慢,所以我决定试试lxml,因为我听说它非常快。
不过,我在让我的代码遍历我想要的部分时遇到了困难,不太确定怎么使用lxml,而且找不到清晰的文档。
这是我的代码:
import urllib, urllib2
from lxml import etree
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
newUrl = 'http://www.tv3.ie/3player'
data = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree = etree.fromstring(data, parser)
for elem in tree.iter("div"):
print elem.tag, elem.attrib, elem.text
这段代码返回了所有的DIV,但我怎么才能只遍历id为'slider1'的那个呢?
div {'style': 'position: relative;', 'id': 'slider1'} None
这个方法不行:
for elem in tree.iter("slider1"):
我知道这可能是个傻问题,但我就是搞不明白……
谢谢!
* 编辑 **
在你的帮助下,加上这段代码后,我现在得到了下面的输出:
for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
print elem[0].tag, elem[0].attrib, elem[0].text
print elem[1].tag, elem[1].attrib, elem[1].text
print elem[2].tag, elem[2].attrib, elem[2].text
print elem[3].tag, elem[3].attrib, elem[3].text
print elem[4].tag, elem[4].attrib, elem[4].text
输出:
a {'href': '/3player/show/392/57922/1/Tallafornia', 'title': '3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension'} None
h3 {} None
span {'id': 'gridcaption'} The Tallafornia crew are back, living in a beachside vill...
span {'id': 'griddate'} 11/01/2013
span {'id': 'gridduration'} 00:27:52
这些都很好,但我缺少了上面a标签的一部分。解析器是不是没有正确处理代码?
我没有得到以下内容:
<img alt="3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension" src='http://content.tv3.ie/content/videos/0378/tallaforniaep2_fri11jan2013_3player_1_57922_180x102.jpg' class='shadow smallroundcorner'></img>
有没有什么想法,为什么它没有提取到这个?
再次感谢,非常有帮助的帖子……
2 个回答
0
这是我自己让它工作的方式,我不确定这是不是最好的方法,欢迎大家评论:
import urllib2, re
from lxml import etree
from datetime import datetime
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
start = datetime.now()
newUrl = 'http://www.tv3.ie/3player' # homepage
data = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree = etree.fromstring(data, parser)
for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow'] | //div[@id='slider1']//div[@id='gridshow']//img[@class='shadow smallroundcorner']"):
if elem.tag == 'img':
img = elem.attrib.get('src')
print 'img: ', img
if elem.tag == 'div':
show = elem[0].attrib.get('href')
print 'show: ', show
titleData = elem[0].attrib.get('title')
match=re.search("3player\s+\|\s+(.+),\s+(\d\d/\d\d/\d\d\d\d)\.\s*(.*)", titleData)
title=match.group(1)
print 'title: ', title
description = match.group(3)
print 'description: ', description
date = elem[3].text
duration = elem[4].text
print 'date: ', date
print 'duration: ', duration
end = datetime.now()
print 'time took was ', (end-start)
运行的时间还不错,虽然没有我预期的比BeautifulSoup快很多。
2
你可以使用以下的XPath表达式:
for elem in tree.xpath("//div[@id='slider1']"):
举个例子:
>>> import urllib2
>>> import lxml.etree
>>> url = 'http://www.tv3.ie/3player'
>>> data = urllib2.urlopen(url)
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(data,parser)
>>> elem = tree.xpath("//div[@id='slider1']")
>>> elem[0].attrib
{'style': 'position: relative;', 'id': 'slider1'}
你需要更好地分析你正在处理的页面内容(一个好的方法是使用Firefox浏览器和Firebug插件)。
你想要获取的<img>
标签其实是<a>
标签的子标签:
>>> for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
... for elem_a in elem.xpath("./a"):
... for elem_img in elem_a.xpath("./img"):
... print '<A> HREF=%s'%(elem_a.attrib['href'])
... print '<IMG> ALT="%s"'%(elem_img.attrib['alt'])
<A> HREF=/3player/show/392/58784/1/Tallafornia
<IMG> ALT="3player | Tallafornia, 01/02/2013. A fresh romance blossoms in the Tallafornia house. Marc challenges Cormac to a 'bench off' in the gym"
<A> HREF=/3player/show/46/58765/1/Coronation-Street
<IMG> ALT="3player | Coronation Street, 01/02/2013. Tyrone bumps into Kirsty in the street and tries to take Ruby from her pram"
../..