使用Python抓取Twitter页面时Unicode等值比较失败

0 投票

1 回答

738 浏览

提问于 2025-04-16 23:58

我正在使用以下代码来获取一个用户在推特上的关注者列表：

import urllib
from BeautifulSoup import BeautifulSoup

#code only looks at one page of followers instead of continuing to all of a user's followers
#decided to only use a small sample 

site = "http://mobile.twitter.com/NYTimesKrugman/following"
friends = set()
response = urllib.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
names = soup.findAll('a', {'href': True})
for name in names:
    a = name.renderContents()
    b = a.lower()
    if ("http://mobile.twitter.com/" + b) == name['href']:
        c = str (b)
        friends.add(c)

for friend in friends:
    print friend
print ("Done!")

但是，我得到了以下结果：

NYTimeskrugman
nytimesphoto
rasermus

Warning (from warnings module):
   File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter     Crawler\crawlerversion14.py", line 42
    if ("http://mobile.twitter.com/" + b) == name['href']:
 UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
amnesty_norge
zynne_
fredssenteret
oljestudentene
solistkoret

....（然后继续）

看起来我能够获取到大部分关注者的名字，但我遇到了一个有点随机的错误。不过，这个错误并没有阻止代码的执行……我希望有人能告诉我发生了什么？

1 个回答

不知道我的回答几年后是否还有用，但我把你的代码改成了用requests库，而不是urllib。

我觉得最好再选择一下类名为“username”的元素，这样只考虑关注者的名字！

下面是修改后的代码：

import requests
from bs4 import BeautifulSoup

site = "http://mobile.twitter.com/paulkrugman/followers"
friends = set()
response = requests.get(site)
soup = BeautifulSoup(response.text)
names = soup.findAll('a', {'href': True})
for name in names:
    pseudo = name.find("span", {"class": "username"})
    if pseudo:
        pseudo = pseudo.get_text()
        friends.add(pseudo)

for friend in friends:
    print (friend)
print("Done !")

注意，@paulkrugman在每一组数据中都会出现，所以别忘了把它删掉！

回答于 2025-04-16 由 Python大师

分享举报

使用Python抓取Twitter页面时Unicode等值比较失败

1 个回答

撰写回答