非结构化CSV-Python数据帧的选择性转置和转换

2024-05-13 16:56:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个无头&;非结构化csv文件,其中前两条记录符合固定布局,但从第三行开始,列号有所不同

100,NEM12,202006011242,TCAUSTM,PROGREEN 
200,VCCCCB0100,E1K1Q1,001,E1,N1,006668,KWH,15,
300,20200406,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,F16,28,,20200601113727,
300,20200407,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,F16,28,,20200601113727,

我已经尝试将行转换为列,如建议的here,这将产生如下输出

enter image description here

但理想情况下,我会将前两个记录作为两个列数据帧处理,但如上所述转置300个记录,除了我想附加VCCCCB0100E1006668&KWH从200的数据帧中添加,使得每300条记录的转置数据帧如下所示。300数据帧从200记录中借用了很多列

enter image description here


Tags: 文件csv数据记录布局amp结构化nem12
1条回答
网友
1楼 · 发布于 2024-05-13 16:56:33

我认为你必须系统地建立你想要的结构。我无论如何都看不见能让from_csv()工作

  1. 直接从文件加载CSV数据,并使用列表理解生成二维列表
  2. concat()来自第一个&;第二排。剩余行的日期,加上读数数组间隔数组
  3. pd.Series.explode步骤2中的数组explode multiple lists
  4. 最后清理、命名列并删除不需要的列
data = '''100,NEM12,202006011242,TCAUSTM,PROGREEN 
200,VCCCCB0100,E1K1Q1,001,E1,N1,006668,KWH,15,
300,20200406,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,F16,28,,20200601113727,
300,20200407,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,F16,28,,20200601113727,'''

with open("sample.csv", "w") as f:
    f.write(data)

with open("sample.csv") as f:
    data = f.read()

arr = [l.split(",") for l in data.split("\n")]    
df = pd.concat([
    pd.DataFrame(arr[0][1:]).T, # pick columns from first row - exclude 100
    pd.DataFrame(arr[1][1:]).T, # pick columns from second row - exclude 200
    # pick the interval_date (exclude 300) and then turn all other columns into an array
    pd.DataFrame(arr[2:]).iloc[:,1:2]\
        .assign(Reading=pd.DataFrame(arr[2:]).iloc[:,2:].values.tolist(),
               Int_No=[[i+1 for i,e in enumerate(l)] for l in pd.DataFrame(arr[2:]).iloc[:,2:].values.tolist()])
    ], axis=1).fillna(method="ffill") # there are two readings rows fill second row
df.columns = [i for i,c in enumerate(df.columns)] # clean up the column numbers
# explode the reading, then name columns and drop redundant ones
# https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe
df = df.set_index(list(df.columns[:-2])).apply(pd.Series.explode).reset_index()\
    .rename({14:"Reading",13:"Interval_date",11:"Interval_length",4:"NEMI",7:"NMI_Suffix",
            10:"UOM",15:"Interval_No"}, axis=1)\
    .drop([0,1,2,3,5,6,8,9,12], 1)

print(df[:10].to_string(index=False))

输出

       NEMI NMI_Suffix  UOM Interval_length Interval_date  Reading Interval_No
 VCCCCB0100         E1  KWH              15      20200406  122.000           1
 VCCCCB0100         E1  KWH              15      20200406  120.000           2
 VCCCCB0100         E1  KWH              15      20200406  120.000           3
 VCCCCB0100         E1  KWH              15      20200406  122.000           4
 VCCCCB0100         E1  KWH              15      20200406  120.000           5
 VCCCCB0100         E1  KWH              15      20200406  120.000           6
 VCCCCB0100         E1  KWH              15      20200406  120.000           7
 VCCCCB0100         E1  KWH              15      20200406  122.000           8
 VCCCCB0100         E1  KWH              15      20200406  120.000           9
 VCCCCB0100         E1  KWH              15      20200406  120.000          10

相关问题 更多 >