使用groupby as_index=False,count,to_frame导致'数据框'对象没有属性to_frame
我正在尝试从一个已有的数据表(dataframe)创建一个新的数据表,使用了 groupby()、count() 和 to_frame() 这些方法。但是在我给 groupby 加上 'as_index=False' 后,出现了一个错误:AttributeError: 'DataFrame' object has no attribute 'to_frame'。
这是我的代码:
newdat = indat.query('-1017 <= WDIR16 <= -1000')
newdat.reset_index(drop=True, inplace=True)
newdat.sort_values(by=['YEAR', 'MO', 'GP', 'HR'], inplace=True)
# Find Count
w1 = newdat.groupby(['YEAR','MO', 'GP','HR'], as_index=False)["WDIR16"].count().to_frame(name='wndclimodirectionobsqty').reset_index()
# Find Means
w1['wndclimomeanspeedrate'] = newdat.groupby(['YEAR','MO', 'GP', 'HR'], as_index=False).aggregate({'WSPD':'mean'}, as_index=False).values
错误发生在 'to_frame' 这一行。我之所以在 groupby 中使用 'as_index=False',是因为有时候原来的数据表可能是空的,但仍然有列存在。参考链接:在空数据表中进行 groupby 后保留列。如果我不加 'as_index=False',那么 'to_frame' 这一行就能正常工作。但是,如果在进行 groupby 时数据表是空的,空的列就不会转移到新的数据表中。你有什么想法吗?
这是新数据表 newdat 的几行:
NETWORKTYPE,PLATFORMID,REPORTTYPECODE,OBSERVATIONTIME,YEAR,MO,DAY,HR,MINUTEDV,PLATFORMHEIGHT,TEMPC,DEWPC,WDIR,WSPD,GUST,SLP,STNPRES,ALSTG,CIG,SKY,CAVOK,VSBY,PRCP1,PRCPTIM1,PRCP2,PRCPTIM2,PRCP3,PRCPTIM3,PRCP4,PRCPTIM4,HUMREL,VAPOR,ABSHUM,SPHUM,TVIRTK,DENSITY,DENALT,PRSALT,SKY100,TEMP_GE32,TEMP_LE0,TEMP_LEM17,TSTM,FOG,FOG3MILE,BLOWSNOW,BLOWSAND,FREZRAIN,HAIL,SNOW,FROZPRCP,SNOWICE,RAIN,ALLPRECP,SMOKHAZE,SANDSNOW,OBSTVISN,U,V,WDIRCOS,WDIRSIN,WDIR16,CALM,LIGHT,WSPDGT12_8,WSPDGT12_3,WSPDGT9_7,WSPDGT17_5,WSPDGT25_2,VSBY_800,VSBY_1600,VSBY_3200,VSBY_4800,GP
ICAO ,KOFF,SAO ,1948-01-12 06:00:00,1948,1,12,6,0,320.0,2.4,0.2,290.0,4.6,,,,,22000.0,8.0,N,11200,,,,,,,,,,6.196962,4.87,,,,,0,100.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.32259,-1.57329,0.3420201029058752,-0.9396926354975091,-1014,0,0,0,0,0,0,0,1,1,1,1,3
ICAO ,KOFF,SAO ,1948-01-12 07:00:00,1948,1,12,6,0,320.0,2.4,-2.6,290.0,5.1,,,,,22000.0,7.0,N,8000,,,,,,,,,,5.045877,3.97,,,,,0,87.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,4.79243,-1.7443,0.3420201029058752,-0.9396926354975091,-1014,0,0,0,0,0,0,0,1,1,1,1,3
ICAO ,KOFF,SAO ,1948-01-12 08:00:00,1948,1,12,9,0,320.0,0.8,-1.5,290.0,4.6,,,,,22000.0,7.0,N,11200,,,,,,,,,,5.473223,4.33,,,,,0,87.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.32259,-1.57329,0.3420201029058752,-0.9396926354975091,-1014,0,0,0,0,0,0,0,1,1,1,1,3
2 个回答
0
同时使用 as_index=False
和 to_frame
是没有意义的。你只能选择其中一个。
要么去掉 as_index=False
:
(newdat.groupby(['YEAR','MO', 'GP','HR'])['WDIR16'].count()
.to_frame(name='wndclimodirectionobsqty').reset_index()
)
要么在 groupby
之后使用 rename
:
(newdat.groupby(['YEAR','MO', 'GP','HR'], as_index=False)['WDIR16'].count()
.rename(columns={'WDIR16': 'wndclimodirectionobsqty'})
)
输出结果:
YEAR MO GP HR wndclimodirectionobsqty
0 1948 1 3 6 2
1 1948 1 3 9 1
如果你想一次性进行多个计算并给它们起自定义的名字,可以考虑使用 groupby.agg
和命名聚合:
(newdat.groupby(['YEAR','MO', 'GP','HR'], as_index=False)
.agg(**{'wndclimodirectionobsqty': ('WDIR16', 'count'),
'wndclimomeanspeedrate' : ('WSPD', 'mean')
})
)
输出结果:
YEAR MO GP HR wndclimodirectionobsqty wndclimomeanspeedrate
0 1948 1 3 6 2 4.85
1 1948 1 3 9 1 4.60
0
如果我理解正确,你可以这样做:
indat = pd.read_csv("your_data.csv")
newdat = indat.query("-1017 <= WDIR16 <= -1000")
newdat.reset_index(drop=True, inplace=True)
newdat.sort_values(by=["YEAR", "MO", "GP", "HR"], inplace=True)
# Find Count
w1 = (
newdat.groupby(["YEAR", "MO", "GP", "HR"], as_index=False)["WDIR16"]
.count()
.rename(columns={"WDIR16": "wndclimodirectionobsqty"})
)
# Find Means
w1["wndclimomeanspeedrate"] = (
newdat.groupby(["YEAR", "MO", "GP", "HR"])["WSPD"].agg("mean").values
)
print(w1)
输出结果是:
YEAR MO GP HR wndclimodirectionobsqty wndclimomeanspeedrate
0 1948 1 3 6 2 4.85
1 1948 1 3 9 1 4.60