Python导入文本文件,其中每行有不同的列数

2024-06-07 02:59:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python新手,我正在尝试如何加载一个包含数据块的数据文件,例如:

TIME:,0
Q01 : A:,-10.7436,0.000536907,-0.00963283,0.00102934
Q02 : B:,0,0.0168694,-0.000413983,0.00345921
Q03 : C:,0.0566665
Q04 : D:,0.074456
Q05 : E:,0.077456
Q06 : F:,0.0744835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0
Q17 : Q:,625

TIME:,0.1
Q01 : A:,-10.563,0.000636907,-0.00963283,0.00102934
Q02 : B:,0,0.01665694
Q03 : C:,0.786,-0.000666,0.6555
Q04 : D:,0.87,0.96
Q05 : E:,0.077456
Q06 : F:,0.07447835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0,2,4
Q17 : Q:,786

每个块包含许多变量,这些变量中的数据列数可能非常不同。每个timestep块中每个变量的列数可能会更改,但每个timestep中每个块的变量数都是相同的,并且总是知道导出了多少个变量。数据文件中没有关于数据块(时间步数)的信息。在

读取数据后,应以变量/时间步的格式加载:

^{pr2}$

如果每个时间步的数据列数相同,并且每个变量的列数相同,这将是一个非常简单的问题。在

我想我需要逐行读取文件,分两个循环,每个块一个,然后在每个块中一次,然后将输入存储在一个数组中(append?)。每行列数的变化让我有点困惑,因为我还不太熟悉python和numpy。在

如果有人能给我指出正确的方向,比如我应该使用什么函数来相对有效地完成这项工作,那就太好了。在


Tags: 数据time数据文件时间列数q06q02q09
3条回答
import pandas as pd
res = {}
TIME = None

# by default lazy line read
for line in open('file.txt'):
    parts = line.strip().split(':')
    map(str.strip, parts)
    if len(parts) and parts[0] == 'TIME':
        TIME = parts[1].strip(',')
        res[TIME] = {}
        print('New time section start {}'.format(TIME))
        # here you can stop and work with data from previou period
        continue

    if len(parts) <= 1:
        continue
    res[TIME][parts[1].lstrip()] = parts[2].strip(',').split(',')

df = pd.DataFrame.from_dict(res, 'columns')
# for example for TIME 0
dfZero = df['0']
print(dfZero)


df = pd.DataFrame.from_dict(res, 'index')

dfA = df['A']
print(dfA)

enter image description here

文件测试.csv:

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4

处理数据:

^{pr2}$

输出:

   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

或者可以使用names参数。在

例如:

1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6

如果您阅读它,您将收到以下错误:

>>> pd.read_csv(r'D:/Temp/test.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6

但如果您传递names参数,您将得到结果:

>>> pd.read_csv(r'D:/Temp/test.csv', names=list('ABCDEF'))

输出:

   A  B  C   D   E   F
0  1  2  1 NaN NaN NaN
1  2  3  4   2   3 NaN
2  1  2  3   3 NaN NaN
3  1  2  3   4   5   6

希望有帮助。在

实现这一点的一种非常简单的方法是读取文本文件并在扫描时创建一个dict结构。以下是一个可能实现目标的示例(基于您提供的输入):

time = 0
output = {}
with open('path_to_file','r') as input_file:
    for line in input_file:
        line = line.strip('\n')
        if 'TIME' in line:
            time = line.split(',')[1]
            output[time] = {}
        else:
            col_name = line.split(':')[1].strip()
            col_value = line.split(':')[2].strip(',') 
            output[time][col_name] = col_value

这将提供一个output对象,它是一个具有以下结构的字典:

^{pr2}$

我想这和你要找的相符。要访问这个字典中的一个值,您应该使用value = output['0.1']['A'],这将产生'-10.563,0.000636907,-0.00963283,0.00102934'

相关问题 更多 >