使用Python抓取Twitter页面时Unicode等值比较失败
我正在使用以下代码来获取一个用户在推特上的关注者列表:
import urllib
from BeautifulSoup import BeautifulSoup
#code only looks at one page of followers instead of continuing to all of a user's followers
#decided to only use a small sample
site = "http://mobile.twitter.com/NYTimesKrugman/following"
friends = set()
response = urllib.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
names = soup.findAll('a', {'href': True})
for name in names:
a = name.renderContents()
b = a.lower()
if ("http://mobile.twitter.com/" + b) == name['href']:
c = str (b)
friends.add(c)
for friend in friends:
print friend
print ("Done!")
但是,我得到了以下结果:
NYTimeskrugman
nytimesphoto
rasermus
Warning (from warnings module):
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion14.py", line 42
if ("http://mobile.twitter.com/" + b) == name['href']:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
amnesty_norge
zynne_
fredssenteret
oljestudentene
solistkoret
....(然后继续)
看起来我能够获取到大部分关注者的名字,但我遇到了一个有点随机的错误。不过,这个错误并没有阻止代码的执行……我希望有人能告诉我发生了什么?
1 个回答
0
不知道我的回答几年后是否还有用,但我把你的代码改成了用requests库,而不是urllib。
我觉得最好再选择一下类名为“username”的元素,这样只考虑关注者的名字!
下面是修改后的代码:
import requests
from bs4 import BeautifulSoup
site = "http://mobile.twitter.com/paulkrugman/followers"
friends = set()
response = requests.get(site)
soup = BeautifulSoup(response.text)
names = soup.findAll('a', {'href': True})
for name in names:
pseudo = name.find("span", {"class": "username"})
if pseudo:
pseudo = pseudo.get_text()
friends.add(pseudo)
for friend in friends:
print (friend)
print("Done !")
注意,@paulkrugman在每一组数据中都会出现,所以别忘了把它删掉!