尝试理解为什么这段代码没有返回相同的值

2024-05-16 19:53:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python和pandas还不熟悉,我试图理解两组代码之间的区别以及它们为什么做不同的事情

我试过把代码行分开,但代码仍然给出不同的答案

女学生主修物理的比例是多少

代码1:

fem_phy = df.query("gender == 'female' & major == 'Physics'").count() / 
         (df.query("gender=='female'")).count()
fem_phy

代码2:

len(df[(df['gender'] == 'female') & (df['admitted']) & 
   (df['major']=='Physics')]) / len(df[(df['gender']=='female') & 
   (df['admitted'])])

我希望第二组代码会像第一组一样返回0.120623


Tags: 代码pandasdflencountphyquerygender
1条回答
网友
1楼 · 发布于 2024-05-16 19:53:10

支票:

#sample data
df = pd.DataFrame({'gender':['female'] * 3 + ['male'] * 2,
                   'major':['Physics'] * 2 + ['Math'] * 3})

print (df)
   gender    major
0  female  Physics
1  female  Physics
2  female     Math
3    male     Math
4    male     Math

对于正确的行过滤,请使用^{}^{},对于相同的输出,第二个行被删除df['admitted']

print (df.query("gender == 'female' & major == 'Physics'"))
   gender    major
0  female  Physics
1  female  Physics

print (df.query("gender=='female'"))
   gender    major
0  female  Physics
1  female  Physics
2  female     Math

print (df[(df['gender']=='female') & (df['major']=='Physics')])
   gender    major
0  female  Physics
1  female  Physics

print (df[(df['gender']=='female')])
   gender    major
0  female  Physics
1  female  Physics
2  female     Math

问题在于^{}—它返回被排除的行数,但值有误—所以这里得到所有Series值的2(因为数据中没有丢失的值):

print (df.query("gender == 'female' & major == 'Physics'").count())
gender    2
major     2
dtype: int64

正确的用法是用len获取长度:

print (len(df.query("gender == 'female' & major == 'Physics'")))
2

print (len(df[(df['gender']=='female') & (df['major']=='Physics')]))
2

或仅按sum计算掩码的True值:

print ((df['gender']=='female') & (df['major']=='Physics'))
0     True
1     True
2    False
3    False
4    False
dtype: bool

print (((df['gender']=='female') & (df['major']=='Physics')).sum())
2

所有这些加在一起就是:

mask1 = (df['gender']=='female')
mask2 = (df['major']=='Physics')
print ((mask1 & mask2).sum() / mask1.sum())
0.6666666666666666

df1 = df.query("gender == 'female' & major == 'Physics'")
df2 = df.query("gender=='female'")
print (len(df1) / len(df2))
0.6666666666666666

df1 = df[(df['gender']=='female') & (df['major']=='Physics')]
df2 = df[(df['gender']=='female')]
print (len(df1) / len(df2))
0.6666666666666666

相关问题 更多 >