在Python Pandas Dataframe中动态添加列进行数据处理
我遇到了一个问题。
假设这是我的CSV文件:
id f1 f2 f3
1 4 5 5
1 3 1 0
1 7 4 4
1 4 3 1
1 1 4 6
2 2 6 0
..........
我的数据行可以根据id进行分组。 我想要生成一个如下所示的CSV作为输出。
f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4 5 5 3 1 0 7 4 4 1 4 6
我希望能够选择要抓取的行数,并将其转换为列(总是从某个id的第一行开始)。在这个例子中,我抓取了3行。 然后我还会跳过一行或多行(在这个例子中只跳过一行),以便从同一id组的最后一行获取最终的列。出于某种原因,我想使用数据框。
经过3到4个小时的挣扎,我找到了一个解决方案,如下所示。 但是我的解决方案速度很慢。我大约有70万行数据,可能有大约7万组id。上面的代码在model=3时,在我这台4GB内存、4核的联想电脑上几乎花了一个小时。我需要将model设置为10或15。 我在Python方面还是个新手,我相信可以做出一些改动来加快速度。有人能详细解释一下我该如何改进代码吗?
非常感谢。
model:要抓取的行数
# train data frame from reading the csv
train = pd.read_csv(filename)
# Get groups of rows with same id
csv_by_id = train.groupby('id')
modelTarget = { 'f1_t','f2_t','f3_t'}
# modelFeatures is a list of features I am interested in the csv.
# The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }
coreFeatures = list(modelFeatures) # cloning
selectedFeatures = list(modelFeatures) # cloning
newFeatures = list(selectedFeatures) # cloning
finalFeatures = list(selectedFeatures) # cloning
# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
newFeatures = [s + '_n' for s in newFeatures]
finalFeatures = finalFeatures + newFeatures
# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget)
# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)
for id_group in csv_by_id:
#id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
group_data = id_group[1]
#hmm - can this be better? I am picking up the rows which I need from first row on wards
df = group_data[coreFeatures][0:model]
# initialize a list
tmp = []
# now keep adding the column values into the list
for index, row in df.iterrows():
tmp = tmp + list(row)
# Wow, this one below surely should have something better.
# So i am picking up the feature column values from the last row of the group of rows for a particular id
targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values
# Think this can be done easier too ? . Basically adding the values to the tmp list again
tmp = tmp + list(targetValues.flatten())
# coverting the list to a dict.
tmpDict = dict(zip(selectedFeatures,tmp))
# then the dict to a dataframe.
tmpDf = pd.DataFrame(tmpDict,index={1})
# I just could not find a better way of adding a dict or list directly into a dataframe.
# And I went through lots and lots of blogs on this topic, including some in StackOverflow.
# finally I add the frame to my main frame
model_data = model_data.append(tmpDf)
# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)
1 个回答
4
分组操作是你的好帮手。
这个方法的效率很高;特征数量只会增加一点点。大致上,它的复杂度是 O(组的数量)
In [28]: features = ['f1','f2','f3']
先创建一些测试数据,组的大小在7到12之间,总共有7万组。
In [29]: def create_df(i):
....: l = np.random.randint(7,12)
....: df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
....: df['A'] = i
....: return df
....:
In [30]: df = concat([ create_df(i) for i in xrange(70000) ])
In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1 629885 non-null int64
f2 629885 non-null int64
f3 629885 non-null int64
A 629885 non-null int64
dtypes: int64(4)
创建一个数据框,从每个组中选择前3行和最后一行(注意,这个方法可以处理小于4的组,但最后一行可能会和其他组重叠,你可能需要用 groupby.filter
来解决这个问题)。
In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
# This step is necesary in pandas < master/0.14 as the returned fields
# will include the grouping field (the A), (is a bug/API issue)
In [33]: groups = groups[features]
In [34]: groups.head(20)
Out[34]:
f1 f2 f3
A
0 0 0 0 0
1 1 1 1
2 2 2 2
7 7 7 7
1 0 0 0 0
1 1 1 1
2 2 2 2
9 9 9 9
2 0 0 0 0
1 1 1 1
2 2 2 2
8 8 8 8
3 0 0 0 0
1 1 1 1
2 2 2 2
8 8 8 8
4 0 0 0 0
1 1 1 1
2 2 2 2
9 9 9 9
[20 rows x 3 columns]
In [38]: groups.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280000 entries, (0, 0) to (69999, 9)
Data columns (total 3 columns):
f1 280000 non-null int64
f2 280000 non-null int64
f3 280000 non-null int64
dtypes: int64(3)
而且速度很快。
In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop
如果你想进一步处理数据,通常在这里就可以停止,使用这个(因为它已经是一个很好处理的分组格式)。
如果你想把这个转换成宽格式
In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]
In [41]: dfg.head()
Out[41]:
f1_1 f2_1 f3_1 f1_2 f2_2 f3_2 f1_3 f2_3 f3_3 f1_4 f2_4 f3_4
A
0 0 0 0 1 1 1 2 2 2 7 7 7
1 0 0 0 1 1 1 2 2 2 9 9 9
2 0 0 0 1 1 1 2 2 2 8 8 8
3 0 0 0 1 1 1 2 2 2 8 8 8
4 0 0 0 1 1 1 2 2 2 9 9 9
[5 rows x 12 columns]
In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1 70000 non-null int64
f2_1 70000 non-null int64
f3_1 70000 non-null int64
f1_2 70000 non-null int64
f2_2 70000 non-null int64
f3_2 70000 non-null int64
f1_3 70000 non-null int64
f2_3 70000 non-null int64
f3_3 70000 non-null int64
f1_4 70000 non-null int64
f2_4 70000 non-null int64
f3_4 70000 non-null int64
dtypes: int64(12)