Python - 在数据操作中转置列为行并在写入文件前进行
我开发了一个公开的开源应用程序,用于Splunk(Nmon性能监控器,适用于Unix和Linux系统,查看https://apps.splunk.com/app/1753/)
这个应用程序的一个重要部分是一个旧的perl脚本(经过回收、修改和更新),它会自动由应用程序启动,用来转换Nmon数据(这是一种自定义的csv格式),从标准输入读取数据,并将其按部分写入格式化的csv文件中(每个部分代表一个性能监控器)
我现在想把这个脚本完全重写成Python,第一版的测试版本几乎完成了……但是我在转换数据时遇到了困难,我担心自己无法解决这个问题。
所以我今天很客气地请求帮助。
以下是具体的困难:
Nmon为多个部分(如cpu、内存、磁盘等)生成性能监控数据,对于大多数部分来说,提取正确的时间戳等并没有太大困难。但对于所有涉及“设备”概念的部分(例如提供的例子中的DISKBUSY,表示磁盘忙碌的时间百分比),需要进行转换和转置,以便后续使用。
目前,我能够生成的数据如下:
示例:
time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
目标是将数据转置,使得我们在表头中看到“时间、设备、值”,例如:
time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0
依此类推。
一个月前,我提出了一个几乎相同需求的问题(针对另一个应用程序,数据不完全相同,但同样需要将列转置为行)
我得到了一个非常好的答案,完美地解决了我的问题,因此我无法将这段代码在这个新环境中复用。不同之处在于,我想将数据转置的过程放在代码内部,这样脚本只在内存中运行,避免处理多个临时文件。
以下是那段代码:
注意:需要使用Python 2.x
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY"]
for section in dynamic_section:
# Set output file
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open output for writing
with open(currsection_output, "w") as currsection:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
currsection.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
这一行:
currsection.write('time' + ',' + header + '\n'),
将包含表头
而这一行:
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
包含逐行的数据
注意:最终的数据(表头和主体数据)还应该包含其他信息,为了简化,我在上面的代码中去掉了这些信息
对于不需要数据转置的静态部分,代码将是:
currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),
而:
currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),
最终的目标是能够在定义所需数据后立即进行转置,然后再写入。
此外,性能和最低系统资源的使用(例如使用内存而不是临时文件)是一个要求,以防止脚本周期性运行时对系统造成过高的CPU负载。
有没有人能帮我实现这个目标?我反复查找过,我很确定有多种方法可以实现这个(zip、map、字典、列表、分割……),但我一直没能做到……
请多包涵,这是我写的第一个真正的Python脚本:-)
非常感谢任何帮助!
更多细节:
- 测试nmon文件
可以在这里获取一个小的测试nmon文件:http://pastebin.com/xHLRbBU0
- 当前完整脚本
当前完整的脚本可以在这里获取:http://pastebin.com/QEnXj6Yh
要测试脚本,需要:
将SPLUNK_HOME变量导出为适合你的任何值,例如:
mkdir /tmp/nmon2csv
--> 将脚本和nmon文件放在这里,并允许脚本执行
export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon
最后:
cat test.nmon | ./nmon2csv.py
数据将生成在/tmp/nmon2csv/etc/apps/nmon/var/*
更新:使用csv模块的工作代码:
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]
for section in dynamic_section:
# Set output file (will opened after transpose)
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open Temp
with TemporaryFile() as tempf:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
tempf.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# Open final for writing
with open(currsection_output, "w") as currsection:
# Rewind temp
tempf.seek(0)
writer = csv.writer(currsection)
writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])
for d in csv.DictReader(tempf):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [section, SN, HOSTNAME, time, device, value]
writer.writerow(row)
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
1 个回答
我们的目标是把数据进行转置,比如我们会有这样的表头:“时间,设备,数值”。
这个简单的转置逻辑大致是这样的:
text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''
import csv
for d in csv.DictReader(text.splitlines()):
time = d.pop('time')
for device, value in sorted(d.items()):
print time, device, value
把所有内容整合到一个完整的脚本中,差不多是这样的:
import csv
with open('transposed.csv', 'wb') as destfile:
writer = csv.writer(destfile)
writer.writerow(['time', 'device', 'value'])
with open('data.csv', 'rb') as sourefile:
for d in csv.DictReader(sourcefile):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [time, device, value]
writer.writerow(row)