使用Python从数据文件中提取几行

1 投票

5 回答

1727 浏览

提问于 2025-04-16 17:17

我有一个很大的文件，里面有很多数据。我需要每5000行左右提取3行。这个数据文件的格式如下：

...

O_sh          9215    1.000000   -2.304400   
 -1.0680E+00  1.3617E+00 -5.7138E+00  
O_sh          9216    1.000000   -2.304400  
 -8.1186E-01 -1.7454E+00 -5.8169E+00  
timestep    501      9216         0         3    0.000500  
   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64      
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01

...

我需要的是每次以“timestep”开头的那一行之后的三行，所以在这个例子中，我需要的是3x3的数组：

   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64

每次“timestep”出现时，都要把这些数据写入一个输出文件。

然后，我还需要把所有这些数组的平均值放在一个数组里。也就是说，得到一个数组，这个数组的每个元素都是在所有数组中同一位置的元素的平均值，涵盖整个文件。

我已经在这个问题上努力了一段时间，但还没有成功提取到正确的数据。

非常感谢，这不是作业。你们的建议将有助于科学的进步！=)

谢谢，

文件操作数据处理数据提取数据分析数据格式数组操作平均值计算时间步长

5 个回答

好的，你可以这样做：

算法：

Read the file line by line
if the line starts with "timestep":
    read the next three lines
    take the average as needed

代码：

def getArrays(f):
    answer = [[0, 0, 0], [0, 0, 0], [0, 0, 0]]
    count = 0
    line = f.readline()
    while line:
        if line.strip().startswith("timestep"):
            one, two, three = getFloats(f.readline().strip()), getFloats(f.readline().strip()), getFloats(f.readline().strip())
            answer[0][0] = ((answer[0][0]*count) + one[0])/(count+1)
            answer[0][1] = ((answer[0][0]*count) + one[1])/(count+1)
            answer[0][2] = ((answer[0][0]*count) + one[2])/(count+1)

            answer[1][0] = ((answer[0][0]*count) + two[0])/(count+1)
            answer[1][1] = ((answer[0][0]*count) + two[1])/(count+1)
            answer[1][2] = ((answer[0][0]*count) + two[2])/(count+1)

            answer[2][0] = ((answer[0][0]*count) + three[0])/(count+1)
            answer[2][1] = ((answer[0][0]*count) + three[1])/(count+1)
            answer[2][2] = ((answer[0][0]*count) + three[2])/(count+1)
        line = f.readline()
        count += 1
    return answer

def getFloats(line):
    answer = []
    for num in line.split():
        if "E" in num:
            parts = num.split("E")
            base = float(parts[0])
            exp = int(parts[1])
            answer.append(base**exp)
        else:
            answer.append(float(num))
    return answer

现在，answer 是一个包含所有 3x3 数组的列表。我不知道你想怎么计算平均值，所以如果你把这个告诉我，我可以把它加到这个算法里。否则，你可以写一个函数来处理我的数组，计算出需要的平均值。

希望这对你有帮助。

回答于 2025-04-16 由 Python大师

分享举报

假设这不是作业，我觉得用正则表达式来解决这个问题有点过于复杂了。如果你知道在以'timestep'开头的那一行之后需要三行内容，为什么不这样处理呢：

Matrices = []

with open('data.txt') as fh:
  for line in fh:
    # If we see timestep put the next three lines in our Matrices list.
    if line.startswith('timestep'):
      Matrices.append([next(fh) for _ in range(3)])

根据评论的内容，当你想从文件中提取接下来的三行时，可以使用next(fh)来保持文件句柄的同步。谢谢！

回答于 2025-04-16 由 Python大师

分享举报

我建议你使用一个协程（如果你不太了解的话，可以把它看作是一个可以接收值的生成器），这样在遍历文件的时候就能保持一个运行中的平均值。

def running_avg():
    count, sum = 0, 0
    value = yield None
    while True:
        if value:
            sum += value
            count += 1
        value = yield(sum/count)

# array for keeping running average
array = [[running_avg() for y in range(3)] for x in range(3)]

# advance to first yield before we begin
[[elem.next() for elem in row] for row in array]

with open('data.txt') as f:
    idx = None
    for line in f:
        if idx is not None and idx < 3:
            for i, elem in enumerate(line.strip().split()):
                array[idx][i].send(float(elem))
            idx += 1
        if line.startswith('timestep'):
            idx = 0

要把array转换成一个平均值的列表，只需要调用每个协程的next方法，它会返回当前的平均值：

averages = [[elem.next() for elem in row] for row in array]

这样你就会得到类似这样的结果：

averages = [[20.54, -11.85, 35.64], [0.006224, 23.71, 35.64], [-20.54, -11.86, 35.64]]

回答于 2025-04-16 由 Python大师

分享举报

使用Python从数据文件中提取几行

5 个回答

撰写回答