第1列和第2列中n最常见的密集交叉表

from fastai2.collab import * from fastai2.tabular.all import * path = untar_data(URLs.ML_100k) # load the ratings from csv ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user','movie','rating','timestamp']) # show a sample of the format ratings.head(10) # slice the most frequent n=20 users and movies most_frequent_users = list(ratings.user.value_counts()[:20]) most_rated_movies = list(ratings.movie.value_counts()[:20]) denser_ratings = ratings[ratings.user.isin(most_frequent_users)] denser_movies = ratings[ratings.movie.isin(most_rated_movies)] # crosstab the most frequent users and movies, showing the ratings pd.crosstab(denser_ratings.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')

1条回答

网友

1楼 · 发布于 2024-05-23 16:34:29

My code has a bug which is making it index into the dataframe incorrectly for what I think I'm doing

没错，有一个bug

most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])

实际上正在获取值计数。因此，如果用户1、2和3分别进行了100次审阅，那么当我们真正需要ID[1,2,3]时，上面的代码将返回[100100100]。要获取最频繁条目的id而不是计数，您需要添加.index to value_counts

most_frequent_users = list(ratings.user.value_counts().index[:20])
most_rated_movies = list(ratings.movie.value_counts().index[:20])

仅此一项就可以将密度提高到最终结果所示的水平。我之前所做的实际上只是一个随机样本（错误地使用值合计作为电影id的查找）

此外，我在文章末尾提到的方法是以最高密度为目标的交叉选项卡的更健壮的通用解决方案。找到最频繁的X，并在该特定集合中找到最频繁的Y。这即使在稀疏数据集中也能很好地工作

n_users = 10
n_movies = 20

# list the ids of the most frequent users (those who rated the most movies)
most_frequent_users = list(ratings.user.value_counts().index[:n_users])
# grab all the ratings made by these most frequent users
denser_users = ratings[ratings.user.isin(most_frequent_users)]

# list the ids of the most frequent movies within this group of users
dense_users_most_rated = list(denser_ratings.movie.value_counts().index[:n_movies])
# grab all the most frequent movies rated by the most frequent users
denser_movies = ratings[ratings.movie.isin(dense_users_most_rated)]

# plot the crosstab
pd.crosstab(denser_users.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')

这正是我想要的

剩下的唯一问题是这种方法有多标准？为什么有些值是浮动的

相关问题更多 >

编程相关推荐

热门问题

热门文章