Python - 在数据操作中转置列为行并在写入文件前进行

Question

我开发了一个公开的开源应用程序，用于Splunk（Nmon性能监控器，适用于Unix和Linux系统，查看https://apps.splunk.com/app/1753/）

这个应用程序的一个重要部分是一个旧的perl脚本（经过回收、修改和更新），它会自动由应用程序启动，用来转换Nmon数据（这是一种自定义的csv格式），从标准输入读取数据，并将其按部分写入格式化的csv文件中（每个部分代表一个性能监控器）

我现在想把这个脚本完全重写成Python，第一版的测试版本几乎完成了……但是我在转换数据时遇到了困难，我担心自己无法解决这个问题。

所以我今天很客气地请求帮助。

以下是具体的困难：

Nmon为多个部分（如cpu、内存、磁盘等）生成性能监控数据，对于大多数部分来说，提取正确的时间戳等并没有太大困难。但对于所有涉及“设备”概念的部分（例如提供的例子中的DISKBUSY，表示磁盘忙碌的时间百分比），需要进行转换和转置，以便后续使用。

目前，我能够生成的数据如下：

示例：

time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0

目标是将数据转置，使得我们在表头中看到“时间、设备、值”，例如：

time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0

依此类推。

一个月前，我提出了一个几乎相同需求的问题（针对另一个应用程序，数据不完全相同，但同样需要将列转置为行）

Python - CSV时间导向将大量列转置为行

我得到了一个非常好的答案，完美地解决了我的问题，因此我无法将这段代码在这个新环境中复用。不同之处在于，我想将数据转置的过程放在代码内部，这样脚本只在内存中运行，避免处理多个临时文件。

以下是那段代码：

注意：需要使用Python 2.x

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################


dynamic_section = ["DISKBUSY"]

for section in dynamic_section:

    # Set output file
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open output for writing
    with open(currsection_output, "w") as currsection:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        currsection.write('time' + ',' + header + '\n'),


                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

        # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

这一行：

                currsection.write('time' + ',' + header + '\n'),

将包含表头

而这一行：

            currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

包含逐行的数据

注意：最终的数据（表头和主体数据）还应该包含其他信息，为了简化，我在上面的代码中去掉了这些信息

对于不需要数据转置的静态部分，代码将是：

                    currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),

而：

                currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),

最终的目标是能够在定义所需数据后立即进行转置，然后再写入。

此外，性能和最低系统资源的使用（例如使用内存而不是临时文件）是一个要求，以防止脚本周期性运行时对系统造成过高的CPU负载。

有没有人能帮我实现这个目标？我反复查找过，我很确定有多种方法可以实现这个（zip、map、字典、列表、分割……），但我一直没能做到……

请多包涵，这是我写的第一个真正的Python脚本:-)

非常感谢任何帮助！

更多细节：

测试nmon文件

可以在这里获取一个小的测试nmon文件：http://pastebin.com/xHLRbBU0

当前完整脚本

当前完整的脚本可以在这里获取：http://pastebin.com/QEnXj6Yh

要测试脚本，需要：

将SPLUNK_HOME变量导出为适合你的任何值，例如：

mkdir /tmp/nmon2csv

--> 将脚本和nmon文件放在这里，并允许脚本执行

export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon

最后：

cat test.nmon | ./nmon2csv.py

数据将生成在/tmp/nmon2csv/etc/apps/nmon/var/*

更新：使用csv模块的工作代码：

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################

dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]

for section in dynamic_section:

    # Set output file (will opened after transpose)
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open Temp
    with TemporaryFile() as tempf:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        tempf.write('time' + ',' + header + '\n'),  

                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),


        # Open final for writing
        with open(currsection_output, "w") as currsection:

            # Rewind temp
            tempf.seek(0)

            writer = csv.writer(currsection)
            writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])           

            for d in csv.DictReader(tempf):
                time = d.pop('time')
                for device, value in sorted(d.items()):
                    row = [section, SN, HOSTNAME, time, device, value]
                    writer.writerow(row)            

            # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

Unix系统数据处理开源应用性能监控内存优化 csv格式数据转置 nmon工具

Python - 在数据操作中转置列为行并在写入文件前进行

1 个回答

撰写回答