当扩展的url不够时，用Tweepy完成url（与urllib2集成？）

import tweepy import codecs auth = tweepy.OAuthHandler("xxx", "xxx") auth.set_access_token("yyy", "yyy") with codecs.open("file.txt", encoding='utf-8', mode='w+') as f: api = tweepy.API(auth) for status in tweepy.Cursor(api.user_timeline, "xxx", include_entities=True).items(): ... # Extracting info from the entities for hashtag in status.entities['hashtags']: f.write(format(hashtag['text'])) for url in status.entities['urls']: f.write(format(url['expanded_url']))

2条回答

网友

1楼 · 编辑于 2024-04-18 05:21:25

多亏了swstephe，我把注意力集中在head请求上，这样就避免了打开页面，我发现模块请求非常好用。在

所以我找到了一个解决方案：

   import requests
   for url in status.entities['urls']:
        expanded_url=url['expanded_url']
        r= requests.head(expanded_url)
        if r.status_code in range (200,300):
            f.write(format(r.url))
        elif r.status_code in range (300,400):
            f.write(format(r.headers['location']))
        else:
            f.write(format(r.status_code))

我还是不明白为什么urllib2不能工作。我想从现在开始我会使用请求。谢谢你的帮助。我真的很感激。在

网友

2楼 · 编辑于 2024-04-18 05:21:25

这个网址来自谷歌，所以我不认为Tweepy存储的是如果你点击这个链接，谷歌会给你指示的地方。您可以使用httplib找到这一点，（这样您就可以获得HEAD，而不必获取它将加载的页面的完整获取）：

import httplib
from urlparse import urlparse

url = urlparse('http://goo.gl/sOH17n')    # split URL into components
conn = httplib.HTTPConnection(url.hostname, url.port)
conn.request('HEAD', url.path)            # just look at the headers
rsp = conn.getresponse()
if rsp.status in (301,401):               # resource moved (permanent|temporary)
    print rsp.getheader('location')
else:
    print url
conn.close()

当我运行它时，我得到一个URL，而不是403错误。这个错误通常表示您没有权限查看该页面，所以我猜您给出的URL不是您发布的那个。在

相关问题更多 >

编程相关推荐

热门问题

热门文章