使用lxml从网站抓取信息
我正在尝试用lxml从Reddit.com网站抓取所有标题的列表。我用了这个查询:
reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
reddit.xpath("//div[contains(@class,'title')]//b/text()")
但是,当我在Python命令行运行这个表达式时,什么都没有显示出来。是不是XPath写错了?
我用的是Python 2.7
这是完整的代码:
import urllib
import os, random, sys, math
from lxml import etree
def main():
reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
reddit.xpath("//div[contains(@class,'title')]//b/text()")
if __name__ == "__main__":
main()
2 个回答
2
你没有连接到互联网。请再试一次。
或者
你的Python安装可能坏了,或者你把两个错误信息搞混了……注意路径怎么突然从3.1变成2.7!!!!!!
更新
在命令行里什么都看不见,因为你没有打印任何东西。
至少如果你把 reddit.xpath("blahblah")
换成:
result = reddit.xpath("blahblah")
print result
你会看到当前的“blahblah”返回的是 []
,这样你就能清楚地知道如果调整“blahblah”会不会改善情况。
6
Reddit有一个API接口,你不需要去抓取数据。只需要在网址的末尾加上 '.json'
就可以了:
#!/usr/bin/env python
import json
import urllib2
url = "http://www.reddit.com/r/all/top/.json"
data = json.load(urllib2.urlopen(url))
for child in data['data']['children']:
print child['data']['title']
示例输出
Dear America, I Saw You Naked: And yes, we were laughing. Confessions of an ex-TSA agent My wife and I are expecting our son in June, so I installed a fiber-optic star ceiling :) You wouldn't download a car: Honda releases concept car 3D printing files So my liquor store I managed closed today, the VP came in to collect the liquor but told me "we're not going to resell the beer, we'll be here about an hour fill up your car." Baby Olinguito (Recently Discovered Species!) Bower Bird- in a desperate bid for attention from the opposite sex, Bower males build nests, then decorate with objects of a single color. (xpost- /r/everythingscience) My friend works as a English teacher in Sweden. My kid's homework, I think the page designer has had enough. Man Washes up in Marshall Islands 'After 16 Months Adrift' at sea Kitten plays the air harp New roommate already started off on a bad note with us. MRW a program crashes and asks to contact tech support... and I am tech support. Jack Black just posted this to facebook. "This is fan art. But it's exactly how I remember it." Looks like Colorado's legalization has caused problems after all. [4] My new kitten likes to "hold hands." She does this for as long as you offer your finger. Ahahaha he got you go-wahhhh Shipwrecked man makes land 'after 16 months adrift' As someone who's taken math at university Footage released of Guardian editors destroying Snowden hard drives: GCHQ technicians watched as journalists took angle grinders and drills to computers after weeks of tense negotiations TIL Mike Tyson offered a zoo attendant $10,000 to open the cage of a bullying gorilla so he could "smash that silverback's snotbox." His offer was declined. Microsoft being helpful as always President Barack Obama says in a new interview that he would support efforts to remove marijuana from the federal government’s list of the most serious narcotics, but that Congress must act to make the change. advisory Vila Franca's Islet, Azores Archipelago, Portugal [1440x900] - How can that be so spherical? The dad on my Child Development book is putting the kids helmet on backwards.