从雅虎财经社区论坛提取所有评论
我正在用Python的Selenium库抓取Yahoo Finance上特定股票(比如TSLA)的评论和回复。提取所有评论及其回复比较困难,因为Yahoo Finance需要用户互动才能显示每条评论下的回复,而且每条评论没有独特的标识符。此外,处理被删除的评论也让事情变得更加复杂。
这是我目前使用的方法。
import requests
import json
# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = json.dumps({
"conversation_id": "sp_Rba9aFpG_finmb$27444752", # Updated to match the desired format
"count": 250,
"offset": 0,
"sort_by": "newest" # Assuming you want to sort by the newest; adjust as needed
})
api_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0',
'Content-Type': 'application/json',
'x-spot-id': "sp_Rba9aFpG", # Spot ID as per your configuration
'x-post-id': "finmb$27444752", # Post ID updated to reflect the desired conversation
# Include any other necessary headers as per the API documentation or your requirements
}
# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)
# Parse the JSON response and print it
data = response.json()
print(json.dumps(data, indent=4)) # Print the response data formatted for readability
1 个回答
0
你可以一直循环调用这个API,直到所有的评论都获取完。
import requests
from pprint import pprint
import json
# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = {
"conversation_id": "sp_Rba9aFpG_finmb$27444752",
"count": 25,
"offset": 0,
"sort_by": "newest",
}
api_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0",
"Content-Type": "application/json",
"x-spot-id": "sp_Rba9aFpG", # Spot ID as per your configuration
"x-post-id": "finmb$27444752", # Post ID updated to reflect the desired conversation
# Include any other necessary headers as per the API documentation or your requirements
}
# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=json.dumps(payload))
comments = []
data = response.json()
while data["conversation"]["has_next"]:
comm = data["conversation"]["comments"]
comments.extend(comm)
pprint(comm)
payload["offset"] = data["conversation"]["offset"]
response = requests.post(api_url, headers=api_headers, data=json.dumps(payload))
data = response.json()
pprint(comments)
这是代码,但对于像这种有大约90万个评论的情况,几乎不可能在没有任何错误的情况下完成。
所以我建议你把这些评论存到一个数据库里,比如说SQLITE
,并创建一些检查点
(记录最后一个偏移量
),这样如果出现任何问题,就可以从上次停止的地方继续。
还有一个问题是,Yahoo可能会把你从API中封禁。