选择数据帧的一个子集，每个变量的d值为N年

import pandas as pd data = {'Country':['Israel','Congo','Denmark', 'Israel','Denmark', 'Israel','Congo', 'Israel','Congo','Denmark'], 'Year':[2000,2000,2000, 2001,2001, 2002,2002, 2003,2003,2003], 'Value':[2.5,1.2,3.1,2.8,1.1,2.9,3.1,1.9,3.0,3.1]} df = pd.DataFrame(data=data) df Country Year Value 0 Israel 2000 2.5 1 Congo 2000 1.2 2 Denmark 2000 3.1 3 Israel 2001 2.8 4 Denmark 2001 1.1 5 Israel 2002 2.9 6 Congo 2002 3.1 7 Israel 2003 1.9 8 Congo 2003 3.0 9 Denmark 2003 3.1

gp = df.groupby('Country').groups #Group by country name d = {} #Build dictionary Country Name => index list. for i in gp: #Iterate over all countries until a list of 3 indeces is #reached for each country. d[i] = [] for j in gp[i]: if len(d[i])<3: #A country appears once every year in the dataset, #3 means 3 years. If a country appears more than 3 times, it will only #include the indices of the first 3 occurrences. d[i].append(j) indeces = [] #Gather the indeces to keep in the dataframe. for i in d: for j in d[i]: if len(d[i])==3: #make sure the list has exactly 3 items indeces.append(j) final_df = df.loc[indeces,['Country','Year','Value']] final_df #Now I have one less value for Israel, so all countries have 3 values. Country Year Value 1 Congo 2000 1.2 6 Congo 2002 3.1 8 Congo 2003 3.0 2 Denmark 2000 3.1 4 Denmark 2001 1.1 9 Denmark 2003 3.1 0 Israel 2000 2.5 3 Israel 2001 2.8 5 Israel 2002 2.9

2条回答

网友

1楼 · 编辑于 2024-04-25 22:34:06

这是我使用熊猫的解决方案。它做了它必须做的事情，即使它使用了很多行代码。感谢@Vaishali的帮助：

threshold = 3 #Anything that occurs less than this will be removed, 
              #if it ocurrs more, the extra ocurrences with the least values 
              #will be removed.
newIndex = df.set_index('Country')#set new index to make selection by   
                                  #index posible.
values = newIndex.index.value_counts() #Count occurrences of index values.
to_keep = values[values>=threshold].index.values 
#Keep index values that ocurr >= threshold.
rank_df = newIndex.loc[to_keep,['Value','Year']]#Select rows and  
                                                #columns to keep.

#Sort values in descending order before meeting threshold.
rank_df = rank_df.sort_values('Value',ascending=False)
rank_df = rank_df.groupby(rank_df.index).head(threshold)#group again 
#Since values are sorted, head() will show highest values
rank_df = rank_df.groupby([rank_df.index,'Year']).mean() \
              .sort_values('Value',ascending=False)

#Finally, reset index to convert Year index into a column, and sort by year
rank_df.reset_index(level=1).sort_values('Year')

输出：

            Year    Value
Country         
Denmark     2000    3.1
Israel      2000    2.5
Congo       2000    1.2
Israel      2001    2.8
Denmark     2001    1.1
Congo       2002    3.1
Israel      2002    2.9
Denmark     2003    3.1
Congo       2003    3.0

网友

2楼 · 编辑于 2024-04-25 22:34:06

您可以从“年”列中的唯一值创建最近几年的列表，并使用布尔索引来使用该列表对数据帧进行索引。你知道吗

recent_years = df.Year.unique()[-3:]
df[df.Year.isin(recent_years)]

    Country Year    Value
3   Israel  2001    2.8
4   Denmark 2001    1.1
5   Israel  2002    2.9
6   Congo   2002    3.1
7   Israel  2003    1.9
8   Congo   2003    3.0
9   Denmark 2003    3.1

如果您的年份值不一定按顺序排列，请使用numpy unique，它返回一个排序数组，而不是pandas unique（）

recent_years = np.unique(df.Year)[-3:]
df[df.Year.isin(recent_years)]

这里是另一个解决方案，每个国家返回最近3年。如果数据没有按年份排序，则需要先排序。你知道吗

idx = df.groupby('Country').apply(lambda x: x['Year'].tail(3)).index
df.set_index(['Country', df.index]).reindex(idx).reset_index().drop('level_1', 1)

    Country Year    Value
0   Congo   2000    1.2
1   Congo   2002    3.1
2   Congo   2003    3.0
3   Denmark 2000    3.1
4   Denmark 2001    1.1
5   Denmark 2003    3.1
6   Israel  2001    2.8
7   Israel  2002    2.9
8   Israel  2003    1.9

如果数据没有排序，首先使用

df = df.sort_values(by = 'Year')

相关问题更多 >

编程相关推荐

热门问题

热门文章