使用Python抓取Twitter页面时Unicode等值比较失败

0 投票
1 回答
738 浏览
提问于 2025-04-16 23:58

我正在使用以下代码来获取一个用户在推特上的关注者列表:

import urllib
from BeautifulSoup import BeautifulSoup

#code only looks at one page of followers instead of continuing to all of a user's followers
#decided to only use a small sample 

site = "http://mobile.twitter.com/NYTimesKrugman/following"
friends = set()
response = urllib.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
names = soup.findAll('a', {'href': True})
for name in names:
    a = name.renderContents()
    b = a.lower()
    if ("http://mobile.twitter.com/" + b) == name['href']:
        c = str (b)
        friends.add(c)

for friend in friends:
    print friend
print ("Done!")

但是,我得到了以下结果:

NYTimeskrugman
nytimesphoto
rasermus

Warning (from warnings module):
   File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter     Crawler\crawlerversion14.py", line 42
    if ("http://mobile.twitter.com/" + b) == name['href']:
 UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
amnesty_norge
zynne_
fredssenteret
oljestudentene
solistkoret

....(然后继续)

看起来我能够获取到大部分关注者的名字,但我遇到了一个有点随机的错误。不过,这个错误并没有阻止代码的执行……我希望有人能告诉我发生了什么?

1 个回答

0

不知道我的回答几年后是否还有用,但我把你的代码改成了用requests库,而不是urllib。

我觉得最好再选择一下类名为“username”的元素,这样只考虑关注者的名字!

下面是修改后的代码:

import requests
from bs4 import BeautifulSoup

site = "http://mobile.twitter.com/paulkrugman/followers"
friends = set()
response = requests.get(site)
soup = BeautifulSoup(response.text)
names = soup.findAll('a', {'href': True})
for name in names:
    pseudo = name.find("span", {"class": "username"})
    if pseudo:
        pseudo = pseudo.get_text()
        friends.add(pseudo)

for friend in friends:
    print (friend)
print("Done !")

注意,@paulkrugman在每一组数据中都会出现,所以别忘了把它删掉!

撰写回答