从雅虎财经社区论坛提取所有评论

-1 投票
1 回答
35 浏览
提问于 2025-04-12 19:18

我正在用Python的Selenium库抓取Yahoo Finance上特定股票(比如TSLA)的评论和回复。提取所有评论及其回复比较困难,因为Yahoo Finance需要用户互动才能显示每条评论下的回复,而且每条评论没有独特的标识符。此外,处理被删除的评论也让事情变得更加复杂。

这是我目前使用的方法。

import requests
import json

# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = json.dumps({
  "conversation_id": "sp_Rba9aFpG_finmb$27444752",  # Updated to match the desired format
  "count": 250,
  "offset": 0,
  "sort_by": "newest"  # Assuming you want to sort by the newest; adjust as needed
})

api_headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0',
  'Content-Type': 'application/json',
  'x-spot-id': "sp_Rba9aFpG",  # Spot ID as per your configuration
  'x-post-id': "finmb$27444752",  # Post ID updated to reflect the desired conversation
  # Include any other necessary headers as per the API documentation or your requirements
}

# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)

# Parse the JSON response and print it
data = response.json()
print(json.dumps(data, indent=4))  # Print the response data formatted for readability

1 个回答

0

你可以一直循环调用这个API,直到所有的评论都获取完。

import requests
from pprint import pprint
import json

# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = {
    "conversation_id": "sp_Rba9aFpG_finmb$27444752",
    "count": 25,
    "offset": 0,
    "sort_by": "newest",
}


api_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0",
    "Content-Type": "application/json",
    "x-spot-id": "sp_Rba9aFpG",  # Spot ID as per your configuration
    "x-post-id": "finmb$27444752",  # Post ID updated to reflect the desired conversation
    # Include any other necessary headers as per the API documentation or your requirements
}

# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=json.dumps(payload))


comments = []
data = response.json()
while data["conversation"]["has_next"]:
    comm = data["conversation"]["comments"]
    comments.extend(comm)
    pprint(comm)

    payload["offset"] = data["conversation"]["offset"]
    response = requests.post(api_url, headers=api_headers, data=json.dumps(payload))
    data = response.json()
    


pprint(comments)

这是代码,但对于像这种有大约90万个评论的情况,几乎不可能在没有任何错误的情况下完成。

所以我建议你把这些评论存到一个数据库里,比如说SQLITE,并创建一些检查点(记录最后一个偏移量),这样如果出现任何问题,就可以从上次停止的地方继续。

还有一个问题是,Yahoo可能会把你从API中封禁。

撰写回答