如何计算多个CSV文件中数字的平均值？

2 投票

4 回答

3307 浏览

提问于 2025-04-18 13:26

我有一些文件，它们是我进行模拟实验时得到的重复数据，内容大致如下：

generation, ratio_of_player_A, ratio_of_player_B, ratio_of_player_C

所以，数据看起来像这样：

0, 0.33, 0.33, 0.33

1, 0.40, 0.40, 0.20

2, 0.50, 0.40, 0.10

etc

现在，由于我进行了多次实验，我大约有1000个文件，每个实验都有不同的数据。我的问题是，如何把这些数据平均起来，得到一个实验的平均值。

因此，我想要一个文件，里面包含每一代的平均比率（这个平均值是基于多个重复实验的，也就是多个文件的平均）。

所有需要计算平均值的输出文件名都是像output1.csv、output2.csv、output3.csv……一直到output1000.csv这样的格式。

如果有人能帮我写一个shell脚本或者python脚本，我将非常感激。

shell脚本数据分析 CSV处理平均值计算数据重复性输出文件管理实验数据文件批处理

4 个回答

你的问题不是很清楚……如果我理解得没错的话……

>temp
for i in `ls *csv`
more "$i">>temp;

那么你是把不同文件里的所有数据都放在了一个大文件里。你可以试着把这些数据加载到sqlite数据库里（1. 创建一个表 2. 插入数据）。之后你就可以像这样查询你的数据：
select sum(columns)/count(columns) from yourtablehavingtempdata 等等。
可以考虑使用sqlite，因为你的数据是表格形式的，我觉得sqlite会更合适。

回答于 2025-04-18 由 Python大师

分享举报

你可以把这1000个实验都加载到一个数据框里，然后把它们加起来，最后计算平均值。

filepath = tkinter.filedialog.askopenfilenames(filetypes=[('CSV','*.csv')]) #select your files
for file in filepath:
    df = pd.read_csv(file, sep=';', decimal=',')
    dfs.append(df)

temp = dfs[0] #creates a temporary variable to store the df
for i in range(1,len(dfs)): #starts from 1 cause 0 is stored in temp
    temp = temp + dfs[i];
result = temp/len(dfs)

回答于 2025-04-18 由 Python大师

分享举报

下面的代码应该可以正常运行：

from numpy import genfromtxt

files = ["file1", "file2", ...]

data = genfromtxt(files[0], delimiter=',')
for f in files[1:]:
    data += genfromtxt(f, delimiter=',')

data /= len(files)

回答于 2025-04-18 由 Python大师

分享举报

如果我理解得没错，假设你有两个这样的文件：

$ cat file1
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10

$ cat file2
0, 0.99, 1, 0.02
1, 0.10, 0.90, 0.90
2, 0.30, 0.10, 0.30

你想要计算这两个文件中某一列的平均值。下面是处理第一列的一种方法：

补充：我找到了一种更好的方法，使用 pd.concat：

all_files = pd.concat([file1,file2]) # you can easily put your 1000 files here
result = {}
for i in range(3): # 3 being number of generations
    result[i] = all_files[i::3].mean()
result_df = pd.DataFrame(result)
result_df
                       0     1     2
ratio_of_player_A  0.660  0.25  0.40
ratio_of_player_B  0.665  0.65  0.25
ratio_of_player_C  0.175  0.55  0.20

还有一种方法是用 merge，不过这样需要进行多次合并。

import pandas as pd

In [1]: names = ["generation", "ratio_of_player_A", "ratio_of_player_B", "ratio_of_player_C"]
In [2]: file1 = pd.read_csv("file1", index_col=0, names=names)
In [3]: file2 = pd.read_csv("file2", index_col=0, names=names)
In [3]: file1
Out[3]:     
       ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.33               0.33               0.33
1                        0.40               0.40               0.20
2                        0.50               0.40               0.10    

In [4]: file2
Out[4]: 
            ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.99                1.0               0.02
1                        0.10                0.9               0.90
2                        0.30                0.1               0.30



In [5]: merged_file = file1.merge(file2, right_index=True, left_index=True, suffixes=["_1","_2"])
In [6]: merged_file.filter(regex="ratio_of_player_A_*").mean(axis=1)
Out[6]
generation
0             0.66
1             0.25
2             0.40
dtype: float64

或者可以用这种方法（我觉得会快一点）：

merged_file.ix[:,::3].mean(axis=1) # player A

如果你有多个文件，可以先进行递归合并，然后再应用 mean() 方法。

如果我理解错了问题，请告诉我们你希望从 file1 和 file2 中得到什么。

如果有不明白的地方，随时问我。

希望这能帮到你！

回答于 2025-04-18 由 Python大师

分享举报

如何计算多个CSV文件中数字的平均值？

4 个回答

撰写回答