为csv中的列创建不同的元组值,并计算第3列的平均值

2024-03-29 11:11:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据集:

string1 string2 rate distance 
A.      C.      1    20
A.      B       2.   30
A.      C.      2.   20

string1和string2有多个元组值。我想为String1和String2找到不同的元组,然后计算相同的速率/距离的平均值。这只是虚拟数据,而原始数据对特定元组具有倍数(10000)。你知道吗

到目前为止,我已经创建了元组。我不知道如何合并元组和计算平均值

def read_csv(filepath, has_header=False):
    with open(filepath, 'r') as file:
        reader = csv.reader(file)

        data = list(reader)
        header = None
        if has_header:
            header = data[0]
            data = data[1:]


    file.close()
    return data, header

if __name__ == '__main__':

    outfilepath = "data/outfile12.csv"

    outdata = []

    codes, header = read_csv("data/sample.csv", has_header=TRUE)

    # create dictionary
    codes_dict = {

}
        for code in codes:
            codes_dict[(code[0], code[1])]

        for row in codes : 

        #Write logic here

输出应如下所示:

string1 string2 column 
    A      C      0.003    
    A      B     0.00030
    B      A    0.000020

有人能帮我吗。你知道吗


Tags: csv数据readdatacodecodesreaderfile
2条回答

您应该考虑将pandas用于这些任务。Google docs youself针对特定情况(csv文件中没有行标题),我将给出一个基本示例:

import pandas as pd

首先加载csv,它实际上取决于其格式,因此可能需要更改分隔符,我从示例数据中获取了csv格式(多个空格):

dataframe = pd.read_csv(filepath, sep='\s+')

然后按列集合对数据进行分组:

groupby = dataframe.groupby(['string1','string2'])
print(groupby.groups) 

它返回一个“DataFrameGroupBy”对象,该对象本质上是包装器中的一个列表(列值的元组,与该数据匹配的行的dataframe)。你知道吗

然后对这些行应用自定义函数以生成新行:

def add_average_velocity(input_rows):
    input_rows['avg_velocity'] = (input_rows['rate']/input_rows['distance']).mean()
    return input_rows

new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)

或者,如果您想完全删除所有旧数据,只保留新数据:

def add_average_velocity(input_rows):
    output_data = pd.Series({'velocity':(input_rows['rate']/input_rows['distance']).mean()})
    # you can skip making a pd.Series objects if you are okay with having the data unnamed in resulting dataframe. You can always rename columns later anyway.
    return output_data

new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)

给你:

=^..^=

import pandas as pd
from io import StringIO

# create raw data
raw_data = StringIO("""
string1 string2 rate distance
A. C. 1 20
A. B 2. 30
A. C. 2. 20""")

# load data into data frame
df = pd.read_csv(raw_data, sep=' ')
# calculate divide
df['divide'] = df['rate'] / df['distance']
# drop not needed columns
df = df.drop(columns=['rate','distance'])
# grop by columns and sum values
result = df.groupby(['string1', 'string2']).mean()

输出:

string1 string2          
A.      B        0.066667
        C.       0.075000

相关问题 更多 >