ML算法的文本和数值列组合

tweet_id airline_sentiment_confidence negativereason negativereason_confidence airline name retweet_count text tweet_created tweet location user_timezone airline_sentiment Tr_tweet_1 1.000 NaN NaN Virgin America 0 tweets date Location Time Positive Tr_tweet_2 0.3846 NaN 0.7033 Virgin America 0 tweets date Location Time Negative Tr_tweet_3 0.6837 Bad flight 0.3342 Virgin America 0 tweets date Location Time Negative Tr_tweet_4 1.000 Can't tell 1.000 Virgin America 0 tweets date Location Time Neutral Tr_tweet_5 1.000 NaN NaN Virgin America 0 tweets date Location Time Neutral

1条回答

网友

1楼 · 发布于 2024-05-16 01:54:04

其中一个方法是，正如你提到的，是堆叠。您可以将每个tweet表示为一个特征向量，其中向量中的每个位置表示一个单词/术语，其值是单词的tf idf值。然后，您可以将每条tweet的tf idf vector与剩余的数字列串联起来，然后将这些向量叠加在一起，得到一个矩阵（一旦有了矩阵，就可以开始尝试不同的机器学习模型！）在

注意一下，一旦您为每个tweet设置了tf-idf向量，那么运行一个维数缩减算法（如PCA）可能是有意义的，因为您将处理一个大而稀疏的向量。另外，根据您的数据，规范化每个连接的向量可能是有意义的（例如，使所有的值都是0-1）。最后，通常单个tweet的文本信息不够丰富。您可能需要考虑将类似的tweet聚合在一起以获得更好的结果。在

相关问题更多 >

编程相关推荐

热门问题

热门文章