Pandas数据帧矢量化/筛选:ValueError:只能比较带相同标签的序列对象

2024-06-08 18:06:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个NHL曲棍球数据帧。一个包含了过去十年里每个球队的每一场比赛,另一个则是我想用计算出的数值来填充它。简单地说,我想从一支球队的前五场比赛中取一个指标,求和,然后把它放到另一场比赛中。我已经在下面删减了我的dfs,以排除其他统计数据,并且只查看一个统计数据

df\ U all包含所有游戏:

>>> df_all
        season      gameId playerTeam opposingTeam  gameDate  xGoalsFor  xGoalsAgainst
1         2008  2008020001        NYR          T.B  20081004      2.287          2.689
6         2008  2008020003        NYR          T.B  20081005      1.793          0.916
11        2008  2008020010        NYR          CHI  20081010      1.938          2.762
16        2008  2008020019        NYR          PHI  20081011      3.030          3.020
21        2008  2008020034        NYR          N.J  20081013      1.562          3.454
...        ...         ...        ...          ...       ...        ...            ...
142576    2015  2015030185        L.A          S.J  20160422      2.927          2.042
142581    2017  2017030171        L.A          VGK  20180411      1.275          2.279
142586    2017  2017030172        L.A          VGK  20180413      1.907          4.642
142591    2017  2017030173        L.A          VGK  20180415      2.452          3.159
142596    2017  2017030174        L.A          VGK  20180417      2.427          1.818

df\u sum\u all将包含计算的统计信息,目前它有一堆空列:

>>> df_sum_all
     season team  xg5  xg10  xg15  xg20
0      2008  NYR    0     0     0     0
1      2009  NYR    0     0     0     0
2      2010  NYR    0     0     0     0
3      2011  NYR    0     0     0     0
4      2012  NYR    0     0     0     0
..      ...  ...  ...   ...   ...   ...
327    2014  L.A    0     0     0     0
328    2015  L.A    0     0     0     0
329    2016  L.A    0     0     0     0
330    2017  L.A    0     0     0     0
331    2018  L.A    0     0     0     0

这是我用来计算xGoalsFor和xGoalsAgainst之比的函数。你知道吗

def calcRatio(statfor, statagainst, games, season, team, statsdf):
    tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
    tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
    tempRatio = tempFor / tempAgainst
    return tempRatio

我相信这是合乎逻辑的。我输入了我想做一个比率的数据,要和多少场比赛,要比赛的赛季和球队,然后从哪里得到数据。我已经分别测试了这些函数,并且知道我可以很好地过滤,统计数据相加,等等。下面是tempforcalculation的独立实现示例:

>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618

看到了吗?它返回一个值。但是,我不能在整个数据帧中执行相同的操作。我错过了什么?我认为这种方法基本上适用于每一行,它将'xg5'列设置为calcRatio函数的输出,该函数使用该行的'season'和'team'来过滤dfu all。你知道吗

>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in calcRatio
  File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
    raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects

干杯,谢谢你的帮助!你知道吗

更新:我使用了iterrows(),它工作得很好,所以我一定不是很了解矢量化。不过,它的功能是一样的——为什么它只能以一种方式工作,而不能以另一种方式工作呢?你知道吗

>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
...     emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
... 
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
     season team       xg5  xg10  xg15  xg20
0      2008  NYR  0.826260     0     0     0
1      2009  NYR  1.288390     0     0     0
2      2010  NYR  0.915942     0     0     0
3      2011  NYR  0.730498     0     0     0
4      2012  NYR  0.980744     0     0     0
..      ...  ...       ...   ...   ...   ...
327    2014  L.A  0.823998     0     0     0
328    2015  L.A  1.147412     0     0     0
329    2016  L.A  1.054947     0     0     0
330    2017  L.A  1.369005     0     0     0
331    2018  L.A  0.721411     0     0     0

[332 rows x 6 columns]

Tags: 数据indfallgamesteamseasonsum
1条回答
网友
1楼 · 发布于 2024-06-08 18:06:59

“ValueError:只能比较标记相同的系列对象”

tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())

变量的输入:

team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all

所以在代码中,(statsdf.playerTeam公司==team),它将比较df\u sum\u alldf\u all中的序列。 如果这两个标签不相同,您将看到上述错误。你知道吗

相关问题 更多 >