如何从.log fi在.csv文件中创建行和列

2024-04-29 04:42:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python将MTurk-in中的.log文件解析为包含行和列的.csv文件。我的数据看起来像:

P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;

目前,我有这个,它把所有的东西都放在列中:

import pandas as pd
from pandas import read_table
log_file = '3BF51CHDTWYBE3LE8DZRA0R5AFGH0H.log'
df = read_table(log_file, sep=';|,', header=None, engine='python')

像这样:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|P|15605|#E5DC22|800|9|R|16108

但是,我似乎无法将其分成多行,因此它看起来更像这样:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|
|P|15605|#E5DC22|800|9|R|16108

也就是说,所有的“p”都在一列中,所有的颜色都在另一列中,“r”等。。你知道吗


Tags: 文件importlogfalsepandasreadtablegreen
2条回答

另一个更快的解决方案:

import pandas as pd
import numpy as np
import io

temp=u"""P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=':', header=None, lineterminator=';')

print (df)
    0                       1
0   P      ,14142,GREEN,800,9
1   R  ,14597,7,y,NaN,Correct
2   P    ,15605,#E5DC22,800,9
3   R  ,16108,7,f,NaN,Correct
4   P      ,17115,GREEN,100,9
5   R  ,17548,7,y,NaN,Correct
6   P    ,18552,#E5DC22,100,9
7   R  ,18972,7,f,NaN,Correct
8   P      ,19979,GREEN,800,9
9   R  ,20379,7,y,NaN,Correct
10  P    ,21387,#E5DC22,800,9
11  R  ,21733,7,f,NaN,Correct
12  P        ,22740,RED,100,9
13  R    ,23139,7,y,NaN,False
14  P       ,24147,BLUE,100,9
15  R    ,24547,7,f,NaN,False
16  P        ,25555,RED,800,9
17  R  ,26043,7,b,NaN,Correct
18  P       ,27051,BLUE,800,9

首先从第一列^{}索引,然后通过^{}删除triling ,,并通过^{}创建DataFrame。最后需要将0添加到列名和^{}

df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True)
df1.columns = df1.columns + 1
df1.reset_index(inplace=True)
print (df1)
    0      1        2    3    4        5
0   P  14142    GREEN  800    9     None
1   R  14597        7    y  NaN  Correct
2   P  15605  #E5DC22  800    9     None
3   R  16108        7    f  NaN  Correct
4   P  17115    GREEN  100    9     None
5   R  17548        7    y  NaN  Correct
6   P  18552  #E5DC22  100    9     None
7   R  18972        7    f  NaN  Correct
8   P  19979    GREEN  800    9     None
9   R  20379        7    y  NaN  Correct
10  P  21387  #E5DC22  800    9     None
11  R  21733        7    f  NaN  Correct
12  P  22740      RED  100    9     None
13  R  23139        7    y  NaN    False
14  P  24147     BLUE  100    9     None
15  R  24547        7    f  NaN    False
16  P  25555      RED  800    9     None
17  R  26043        7    b  NaN  Correct
18  P  27051     BLUE  800    9     None

时间安排:

def jez(df):
    df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True)
    df1.columns = df1.columns + 1
    df1.reset_index(inplace=True)
    return (df1)

print (jez(df))

In [310]: %timeit (pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1))
100 loops, best of 3: 4.85 ms per loop

In [311]: %timeit (jez(df))
1000 loops, best of 3: 1.61 ms per loop

你可以用

In [16]: df = pd.read_csv('log.txt', lineterminator=';', sep=':', header=None)

读取文件(例如,'log.txt'),假设行以';'终止,行中的分隔符是':'。你知道吗

不幸的是,您的第二列现在将包含逗号,您希望在逻辑上分隔逗号。您可以沿行拆分逗号,并将结果连接到第一列:

In [17]: pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1)
Out[17]: 
       0      1        2    3    4        5
0      P  14142    GREEN  800    9      NaN
1      R  14597        7    y  NaN  Correct
2      P  15605  #E5DC22  800    9      NaN
3      R  16108        7    f  NaN  Correct
4      P  17115    GREEN  100    9      NaN
5      R  17548        7    y  NaN  Correct
6      P  18552  #E5DC22  100    9      NaN
7      R  18972        7    f  NaN  Correct
8      P  19979    GREEN  800    9      NaN
9      R  20379        7    y  NaN  Correct
10     P  21387  #E5DC22  800    9      NaN
11     R  21733        7    f  NaN  Correct
12     P  22740      RED  100    9      NaN
13     R  23139        7    y  NaN    False
14     P  24147     BLUE  100    9      NaN
15     R  24547        7    f  NaN    False
16     P  25555      RED  800    9      NaN
17     R  26043        7    b  NaN  Correct
18     P  27051     BLUE  800    9      NaN
19  \n\n    NaN      NaN  NaN  NaN      NaN

相关问题 更多 >