在Python Pandas Dataframe中动态添加列进行数据处理

Question

我遇到了一个问题。

假设这是我的CSV文件：

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

我的数据行可以根据id进行分组。我想要生成一个如下所示的CSV作为输出。

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6

我希望能够选择要抓取的行数，并将其转换为列（总是从某个id的第一行开始）。在这个例子中，我抓取了3行。然后我还会跳过一行或多行（在这个例子中只跳过一行），以便从同一id组的最后一行获取最终的列。出于某种原因，我想使用数据框。

经过3到4个小时的挣扎，我找到了一个解决方案，如下所示。但是我的解决方案速度很慢。我大约有70万行数据，可能有大约7万组id。上面的代码在model=3时，在我这台4GB内存、4核的联想电脑上几乎花了一个小时。我需要将model设置为10或15。我在Python方面还是个新手，我相信可以做出一些改动来加快速度。有人能详细解释一下我该如何改进代码吗？

非常感谢。

model：要抓取的行数

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)

性能优化数据处理数据分析数据框数据分组 csv文件行列转换动态添加列

在Python Pandas Dataframe中动态添加列进行数据处理

1 个回答

撰写回答