Python - CSV时间导向的大量列转为行
我有很多csv文件,这些文件是“列”格式的,我需要对它们进行预处理,最后将它们索引起来。
这些数据是按时间排序的,每个“设备”有很多列(最多128列),比如:
LDEV_XXXXXX.csv
Serial number : XXXXX(VSP)
From : 2014/06/04 05:58
To : 2014/06/05 05:58
sampling rate : 1
"No.","time","00:30:00X(X2497-1)","00:30:01X(X2498-1)","00:30:02X(X2499-1)"
"242","2014/06/04 10:00",0,0,0
"243","2014/06/04 10:01",0,0,0
"244","2014/06/04 10:02",9,0,0
"245","2014/06/04 10:03",0,0,0
"246","2014/06/04 10:04",0,0,0
"247","2014/06/04 10:05",0,0,0
我的目标是把数据转换成行(如果这个词用得对的话),这样我就能更高效地处理这些数据,比如:
"time",device,value
"2014/06/04 10:00","00:30:00X(X2497-1)",0
"2014/06/04 10:00","00:30:01X(X2498-1)",0
"2014/06/04 10:00","00:30:02X(X2499-1)",0
"2014/06/04 10:01","00:30:00X(X2497-1)",0
"2014/06/04 10:01","00:30:01X(X2498-1)",0
"2014/06/04 10:01","00:30:02X(X2499-1)",0
"2014/06/04 10:02","00:30:00X(X2497-1)",9
"2014/06/04 10:02","00:30:01X(X2498-1)",0
"2014/06/04 10:02","00:30:02X(X2499-1)",0
等等……
注意:我保留了原始数据(用“,”作为分隔符),你会发现我需要删除前六行的“No”列,因为它没有意义,但这不是主要目标和难点。
我有一段用Python写的初始代码来转换csv数据,但它并没有完全满足我的需求……
import csv
import sys
infile = sys.argv[1]
outfile = sys.argv[2]
with open(infile) as f:
reader = csv.reader(f)
cols = []
for row in reader:
cols.append(row)
with open(outfile, 'wb') as f:
writer = csv.writer(f)
for i in range(len(max(cols, key=len))):
writer.writerow([(c[i] if i<len(c) else '') for c in cols])
注意,列的数量是任意的,有的文件可能只有几列,有的则最多有128列。
我很确定这是一个常见的需求,但我还没找到确切能做到这一点的Python代码,或者我没能成功……
编辑:
更详细一点:
每个时间戳的行会根据设备的数量重复,这样文件的行数会大大增加(乘以设备的数量),但只有少数几行(时间戳、设备、值)。最终想要的结果已经更新了:-)
编辑:
我希望能够使用脚本,通过参数1指定输入文件,通过参数2指定输出文件:-)
2 个回答
首先,你需要把数据整理成你想要的结构,这样写出来就会简单很多。而且,对于结构比较复杂的CSV文件,通常用DictReader打开会更方便。
from csv import DictReader, DictWriter
with open(csv_path) as f:
table = list(DictReader(f, restval=''))
transformed = []
for row in table:
devices = [d for d in row.viewkeys() - {'time', 'No.'}]
time_rows = [{'time': row['time']} for i in range(len(devices))]
for i, d in enumerate(devices):
time_rows[i].update({'device': d, 'value': row[d]})
transformed += time_rows
这样会生成一个列表,
[{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:00'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:00'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:00'},
{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:01'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:01'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:01'},
{'device': '00:30:00X(X2497-1)', 'value': '9', 'time': '2014/06/04 10:02'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:02'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:02'},
{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:03'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:03'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:03'},
{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:04'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:04'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:04'},
{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:05'},
{'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:05'},
{'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:05'}]
这正是我们想要的。然后,要把数据写回去,你可以使用DictWriter。
# you might sort transformed here so that it gets written out in whatever order you like
column_names = ['time', 'device', 'value']
with open(out_path, 'w') as f:
writer = DictWriter(f, column_names)
writer.writeheader()
writer.writerows(transformed)
编辑:在No.
周围加上引号("
),将代码转换为Python 2,并标明Python 3的使用,去掉调试用的print
。
编辑2:修复了一个愚蠢的错误,导致索引没有递增。
编辑3:新版本允许输入文件包含多个标题,每个标题后面都有数据。
我不确定使用csv
模块是否值得,因为你的分隔符是固定的,没有引号,也没有字段包含换行符或分隔符字符:line.strip.split(',')
就足够了。
这是我尝试过的:
- 跳过行,直到找到以No.开头的行,然后读取前两个字段后的内容以获取标识符。
- 逐行处理
- 在第二个字段中获取日期。
- 使用标识符打印每个字段的内容,忽略前两个字段。
Python 2的代码(如果是Python 3,请去掉第一行from __future__ import print_function
)。
from __future__ import print_function
class transposer(object):
def _skip_preamble(self):
for line in self.fin:
if line.strip().startswith('"No."'):
self.keys = line.strip().split(',')[2:]
return
raise Exception('Initial line not found')
def _do_loop(self):
for line in self.fin:
elts = line.strip().split(',')
dat = elts[1]
ix = 0
for val in elts[2:]:
print(dat, self.keys[ix], val, sep=',', file = self.out)
ix += 1
def transpose(self, ficin, ficout):
with open(ficin) as fin:
with open(ficout, 'w') as fout:
self.do_transpose(fin, fout)
def do_transpose(self, fin, fout):
self.fin = fin
self.out = fout
self._skip_preamble()
self._do_loop()
用法:
t = transposer()
t.transpose('in', 'out')
如果输入文件包含多个标题,则在每个标题上都需要重置键的列表:
from __future__ import print_function
class transposer(object):
def _do_loop(self):
line_number = 0
for line in self.fin:
line_number += 1
line = line.strip();
if line.strip().startswith('"No."'):
self.keys = line.strip().split(',')[2:]
elif line.startswith('"'):
elts = line.strip().split(',')
if len(elts) == (len(self.keys) + 2):
dat = elts[1]
ix = 0
for val in elts[2:]:
print(dat, self.keys[ix], val, sep=',', file = self.out)
ix += 1
else:
raise Exception("Syntax error line %d expected %d values found %d"
% (line_number, len(self.keys), len(elts) - 2))
def transpose(self, ficin, ficout):
with open(ficin) as fin:
with open(ficout, 'w') as fout:
self.do_transpose(fin, fout)
def do_transpose(self, fin, fout):
self.fin = fin
self.out = fout
self.keys = []
self._do_loop()