管理Tweepy API搜索

35 投票

5 回答

92960 浏览

数据工程师

提问于 2025-04-17 22:45

请原谅我，如果这个问题之前在别的地方已经被回答过，但我对如何使用tweepy的API搜索功能感到很困惑。有没有什么文档可以告诉我如何用api.search()这个函数来搜索推文呢？

我能否控制一些功能，比如返回的推文数量、结果类型等等？

结果似乎最多只能返回100条，原因不明。

我使用的代码片段如下：

searched_tweets = self.api.search(q=query,rpp=100,count=1000)

结果类型 tweepy 数据限制推文数量 api搜索

5 个回答

你可以用特定的字符串来搜索推文，下面是示例：

tweets = api.search('Artificial Intelligence', count=200)

回答于 2025-04-17 由 Python大师

分享举报

我正在提取某个地点（这里是印度）周围的推特数据，目标是找到所有包含特定关键词或关键词列表的推文。

import tweepy
import credentials    ## all my twitter API credentials are in this file, this should be in the same directory as is this script

## set API connection
auth = tweepy.OAuthHandler(credentials.consumer_key, 
                            credentials.consumer_secret)
auth.set_access_secret(credentials.access_token, 
                        credentials.access_secret)
    
api = tweepy.API(auth, wait_on_rate_limit=True)    # set wait_on_rate_limit =True; as twitter may block you from querying if it finds you exceeding some limits

search_words = ["#covid19", "2020", "lockdown"]

date_since = "2020-05-21"

tweets = tweepy.Cursor(api.search, =search_words,
                       geocode="20.5937,78.9629,3000km",
                       lang="en", since=date_since).items(10)
## the geocode is for India; format for geocode="lattitude,longitude,radius"
## radius should be in miles or km


for tweet in tweets:
    print("created_at: {}\nuser: {}\ntweet text: {}\ngeo_location: {}".
            format(tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location))
    print("\n")
## tweet.user.location will give you the general location of the user and not the particular location for the tweet itself, as it turns out, most of the users do not share the exact location of the tweet

结果：

created_at: 2020-05-28 16:48:23
user: XXXXXXXXX
tweet text: RT @Eatala_Rajender: Media Bulletin on status of positive cases #COVID19 in Telangana. (Dated. 28.05.2020)
# TelanganaFightsCorona 
# StayHom…
geo_location: Hyderabad, India

回答于 2025-04-17 由 Python大师

分享举报

之前的问题都比较老，而且API（应用程序接口）变化很大。

这里有个简单的方法，可以使用游标（Cursor），具体可以参考这个游标教程。页面会返回一个元素列表（你可以限制返回多少页，比如用.pages(5)只返回5页）：

for page in tweepy.Cursor(api.search, q='python', count=100, tweet_mode='extended').pages():
    # process status here
    process_page(page)

这里的q是你要查询的内容，count是你希望每次请求带回多少条数据（最多可以请求100条），而tweet_mode='extended'是为了获取完整的文本内容。（如果不加这个，文本会被截断到140个字符）更多信息可以在这里找到。转发的内容会被截断，这一点已经被jaycech3n确认过。

如果你不想使用tweepy.Cursor，那么你需要指定max_id来获取下一部分数据。更多信息请查看这里。

last_id = None
result = True
while result:
    result = api.search(q='python', count=100, tweet_mode='extended', max_id=last_id)
    process_result(result)
    # we subtract one to not have the same again.
    last_id = result[-1]._json['id'] - 1

回答于 2025-04-17 由 Python大师

分享举报

你的代码有个问题。根据Twitter的文档关于获取搜索推文的内容，

The number of tweets to return per page, up to a maximum of 100. Defaults to 15. This was   
formerly the "rpp" parameter in the old Search API.

你的代码应该是这样的，

CONSUMER_KEY = '....'
CONSUMER_SECRET = '....'
ACCESS_KEY = '....'
ACCESS_SECRET = '....'

auth = tweepy.auth.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)
search_results = api.search(q="hello", count=100)

for i in search_results:
    # Do Whatever You need to print here

回答于 2025-04-17 由 Python大师

分享举报

最开始，我是根据 Yuva Raj 的建议来解决这个问题的，建议是使用额外的参数在 GET search/tweets 中使用 max_id 参数，结合每次循环中返回的最后一条推文的 id，同时还要检查是否出现了 TweepError。

不过，我发现有一种更简单的方法可以解决这个问题，那就是使用 tweepy.Cursor（想了解更多关于如何使用 Cursor 的内容，可以查看 tweepy Cursor 教程）。

下面的代码可以获取最近的 1000 条关于 'python' 的提及。

import tweepy
# assuming twitter_authentication.py contains each of the 4 oauth elements (1 per line)
from twitter_authentication import API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

query = 'python'
max_tweets = 1000
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query).items(max_tweets)]

更新：针对 Andre Petre 提到的关于 tweepy.Cursor 可能会消耗内存的问题，我会把我最初的解决方案也放上来，替换掉上面用来计算 searched_tweets 的单行列表推导式，改用以下代码：

searched_tweets = []
last_id = -1
while len(searched_tweets) < max_tweets:
    count = max_tweets - len(searched_tweets)
    try:
        new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
        if not new_tweets:
            break
        searched_tweets.extend(new_tweets)
        last_id = new_tweets[-1].id
    except tweepy.TweepError as e:
        # depending on TweepError.code, one may want to retry or wait
        # to keep things simple, we will give up on an error
        break

回答于 2025-04-17 由 Python大师

分享举报

管理Tweepy API搜索

5 个回答

撰写回答