Python中的Scikit Learn: 根据一些基于文本的特征预测类别

2024-04-28 16:26:42 发布

您现在位置:Python中文网/ 问答频道 /正文

预测哪些用户认为该评论有用(如果到目前为止还没有人发现它有用,则为“空白”)。或者:1)预测用户的字符串(假设顺序总是按字母顺序排列);或者2)对于每个用户,预测他们是否会发现评论有用。目前,用户数量有限(不到10个),为此编写的代码是可以接受的。但是,考虑一个未来的应用程序,它可以预测更多的用户(比如几千或几百万个可能的用户),这是很有趣的。

样本数据:火车.csv

"id","title","review","user tags","user(s) who find review helpful"
"123","All movies!","I really love movies","love,all","Bill"
"456","No movies!","I really hate movies","hate,none","Jane"
"789","Great show!","That show was really great","great,really","Bill,Jane,Wanda"
"899","Interesting plot!","He makes the plot interesting","interesting,plot",""
"999","So tired!","The ending made me sleep","ending,tired,sleepy",""

测试:使用第1、2、3列的文本预测第4列的文本。忽略id数字列0。

到目前为止,我在这里使用指南(http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)。

当前代码:

^{pr2}$

产生以下输出:

---> 38 predicted = text_clf.predict(data.iloc[100001:101000,5].values)
AttributeError: 'numpy.int64' object has no attribute 'lower'

Tags: 代码text用户idplotshow评论movies