如何使用pandas中的另一列更新列

2024-05-23 13:44:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个数据框架来记录2010-2016年间公立学校的开学数量。在

StatusType  County  2010 ...2016    OpenYear    ClosedYear
1   Closed  Alameda 0        0        2005        2015.0
2   Active  Alameda 0        0        2006         NaN
3   Closed  Alameda 0        0        2008        2015.0
4   Active  Alameda 0        0        2011         NaN
5   Active  Alameda 0        0        2011         NaN
6   Active  Alameda 0        0        2012         NaN
7   Closed  Alameda 0        0        1980        1989.0
8   Active  Alameda 0        0        1980         NaN
9   Active  Alameda 0        0        1980         NaN

我想更新2010-2016年的专栏,以记录每年开学的学校数量。例如,数据框中的第一所学校于2005年开学,2015年关闭。迭代器应该检查“ClosedYear”列,并将1添加到所有列的“rows”值<;2015(20102011…,2014)。在“2010年开始”栏中,加上“2010年开始的第12行”,然后加上“2010年开始”栏

我正在考虑使用“apply”将一个函数应用于dataframe。但这可能不是解决问题的最有效方法。需要帮助来弄清楚如何使这工作!谢谢!在

额外步骤: 完成计数后,我想按县对年份列进行分组。我倾向于使用“groupby”w/sum函数来汇总每个县每年的开放学校数量。如果有人能在上面的问题上加上这个答案,那将是非常有帮助的。在

预期产量:

^{pr2}$

Tags: 数据函数框架数量记录nan学校active
2条回答

我觉得应该有一种不使用for loop的方法来实现这一点,但是,我想不出它是atm,所以我的解决方案是:

# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType  County    OpenYear    ClosedYear
Closed      Alameda   2005        2015.0
Active      Alameda   2006         NaN
Closed      Alameda   2008        2015.0
Active      Alameda   2011         NaN
Active      Alameda   2011         NaN
Active      Alameda   2012         NaN
Closed      Alameda   1980        1989.0
Active      Alameda   1980         NaN
Active      Alameda   1980         NaN"""))

# For each year
for year in range(2010, 2016+1):
    # Create a column of 0s
    df[str(year)] = 0
    # Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
    df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)

print(df.to_string)

输出:

^{pr2}$

(注:我不太确定你想用groupby做什么)

除非确实需要创建这些中间列,否则可以直接使用groupby.size来获得计数,具体取决于是否要包括结束年份,将不等式从<=更改为<。如果你想按县分组,你也可以在同一步骤中这样做。在

这是开始df

  StatusType   County  OpenYear  ClosedYear
1     Closed  Alameda      2005      2015.0
2     Active  Alameda      2006         NaN
3     Closed  Alameda      2008      2015.0
4     Active  Alameda      2011         NaN
5     Active  Alameda      2011         NaN
6     Active  Alameda      2012         NaN
7     Closed  Alameda      1980      1989.0
8     Active  Alameda      1980         NaN
9     Active  Alameda      1980         NaN

import pandas as pd
year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016]
df_list = []

for year in year_list:
    group = ((df.ClosedYear.isnull()) | (df.ClosedYear >= year)) & (df.OpenYear <= year)
    n_schools = df.groupby([group, df.County]).size()[True]
    df_list.append(pd.DataFrame({'n_schools':n_schools, 'year': year}))

ndf = pd.concat(df_list)
#         n_schools  year
#County                  
#Alameda          5  2010
#Alameda          7  2011
#Alameda          8  2012
#Alameda          8  2013
#Alameda          8  2014
#Alameda          8  2015
#Alameda          6  2016

相关问题 更多 >