如何将改变结构的csv读入数据帧

2024-05-23 20:56:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过read_csv导入的csv给我带来了麻烦,因为中的某些行缺少值,这会导致尾随逗号,从而搞乱了如何将其读取到数据帧中

来自Kaggle: Facebook Ad Campaign的数据

CSV示例行

1121593,26/08/2017,26/08/2017,1178,144622,45-49,M,10,16,16,1177535,221,365.6600009,15,3
1121594,26/08/2017,26/08/2017,45-49,M,10,14,14,426500,72,128.2799988,4,1,,
df = pd.read_csv('test.csv', header=None)

# result
        0           1           2      3       4      5   6   7       8   9            10   11           12  13 14
0  1121593  26/08/2017  26/08/2017   1178  144622  45-49   M  10      16  16      1177535  221  365.6600009  15  3
1  1121594  26/08/2017  26/08/2017  45-49       M     10  14  14  426500  72  128.2799988    4            1       

所需

        0           1           2       3         4      5  6   7   8   9        10   11          12  13  14
0  1121593  26/08/2017  26/08/2017  1178.0  144622.0  45-49  M  10  16  16  1177535  221  365.660001  15   3
1  1121594  26/08/2017  26/08/2017     NaN       NaN  45-49  M  10  14  14   426500   72  128.279999   4   1

对如何处理这些数据有什么建议吗?理想情况下,缺失的值(第二行中45-49之前)将显示为NaN


Tags: csv数据test示例dfreadfacebooknan
2条回答

当您得到有已知故障的坏数据集时,一个好的解决方案是清除数据并将好的数据写回磁盘。该代码只需在下载后运行,而您的其他代码不会因修复程序的错误而负担过重。这最好使用csv模块来完成,我们可以逐行修复

kaggle_campaign_data_fixer.py

import sys
import csv
from pathlib import Path

filename = Path(sys.argv[1])
newname = filename.parent/f"{filename.stem}-fixed{filename.suffix}"

BADCOLS = ['', '']

with open(filename, newline='') as infile, open(newname, 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in csv.reader(infile):
        if row[-2:] == BADCOLS:
            row[3:3] = BADCOLS
            del row[-2:]
        writer.writerow(row)

# test it
import pandas as pd

df = pd.read_csv(filename, header=None)
print(df)
print("""
============== FIXED ==================
""")
df = pd.read_csv(newname, header=None)
print(df)
  • 问题是数据集
  • 从第762行(索引为0)开始,数据不再包含'campaign_id''fb_campaign_id'
  • 我建议将数据读入两个独立的数据帧
  • 将缺少的带有None值的列添加到df2
  • df2列进行排序以匹配df1的顺序
  • 合并这两个数据帧
  • 来自Facebook Ad Campaign的有问题的数据集似乎是Sales Conversion Optimization的副本,而Sales Conversion Optimization没有这个问题。
    • 如本数据集discussion所述
    • 因此,在没有附加问题的情况下使用数据集可能是最简单的
import pandas as pd

# read through row 761
df1 = pd.read_csv('data.csv', skiprows=range(762, 1145))

# create a different header for the 2nd part of the data
cols = df1.columns[:3].tolist() + df1.columns[5:].tolist()

# read from row 762 and skip the two blank end columns
df2 = pd.read_csv('data.csv', header=None, names=cols, skiprows=range(762), usecols=range(13))

# add the missing columns to df2
df2[['campaign_id', 'fb_campaign_id']] = None

# sort df2 columns by the same order as df1
df2 = df2.reindex(df1.columns, axis=1)

# combine the two dataframes
df = pd.concat([df1, df2]).reset_index(drop=True)

df-头和尾

  ad_id reporting_start reporting_end campaign_id fb_campaign_id    age gender  interest1  interest2  interest3  impressions  clicks  spent  total_conversion  approved_conversion
 708746      17/08/2017    17/08/2017         916         103916  30-34      M         15         17         17         7350       1   1.43                 2                    1
 708749      17/08/2017    17/08/2017         916         103917  30-34      M         16         19         21        17861       2   1.82                 2                    0
 708771      17/08/2017    17/08/2017         916         103920  30-34      M         20         25         22          693       0   0.00                 1                    0
 708815      30/08/2017    30/08/2017         916         103928  30-34      M         28         32         32         4259       1   1.25                 1                    0
 708818      17/08/2017    17/08/2017         916         103928  30-34      M         28         33         32         4133       1   1.29                 1                    1
 1314410      19/08/2017    19/08/2017        None           None  45-49      F        109        111        114      1129773     252  358.189997                13                    2
 1314411      19/08/2017    19/08/2017        None           None  45-49      F        110        111        116       637549     120  173.880003                 3                    0
 1314412      19/08/2017    19/08/2017        None           None  45-49      F        111        113        117       151531      28   40.289999                 2                    0
 1314414      17/08/2017    17/08/2017        None           None  45-49      F        113        114        117       790253     135  198.710000                 8                    2
 1314415      17/08/2017    17/08/2017        None           None  45-49      F        114        116        118       513161     114  165.609999                 5                    2

df1-头和尾

  ad_id reporting_start reporting_end  campaign_id  fb_campaign_id    age gender  interest1  interest2  interest3  impressions  clicks  spent  total_conversion  approved_conversion
 708746      17/08/2017    17/08/2017          916          103916  30-34      M         15         17         17         7350       1   1.43                 2                    1
 708749      17/08/2017    17/08/2017          916          103917  30-34      M         16         19         21        17861       2   1.82                 2                    0
 708771      17/08/2017    17/08/2017          916          103920  30-34      M         20         25         22          693       0   0.00                 1                    0
 708815      30/08/2017    30/08/2017          916          103928  30-34      M         28         32         32         4259       1   1.25                 1                    0
 708818      17/08/2017    17/08/2017          916          103928  30-34      M         28         33         32         4133       1   1.29                 1                    1
 1121585      20/08/2017    20/08/2017         1178          144621  40-44      M         66         72         68         9773       1    1.460000                 1                    0
 1121589      20/08/2017    20/08/2017         1178          144622  45-49      M         10         16         11       464036      77  123.550000                 3                    1
 1121590      20/08/2017    20/08/2017         1178          144622  45-49      M         10         16         15       478480      75  135.750001                 3                    1
 1121592      20/08/2017    20/08/2017         1178          144622  45-49      M         10         14         11       428812      66  116.880000                 4                    2
 1121593      26/08/2017    26/08/2017         1178          144622  45-49      M         10         16         16      1177535     221  365.660001                15                    3

df2-头部和尾部

   ad_id reporting_start reporting_end    age gender  interest1  interest2  interest3  impressions  clicks       spent  total_conversion  approved_conversion
 1121594      26/08/2017    26/08/2017  45-49      M         10         14         14       426500      72  128.279999                 4                    1
 1121597      30/08/2017    30/08/2017  45-49      M         15         21         19        54237       7   10.780000                 2                    1
 1121598      30/08/2017    30/08/2017  45-49      M         15         19         18       506916      89  133.699999                 2                    2
 1121599      30/08/2017    30/08/2017  45-49      M         15         17         18       250960      42   64.880000                 2                    0
 1121601      30/08/2017    30/08/2017  45-49      M         16         20         18      2286228     353  603.380002                16                    7
 1314410      19/08/2017    19/08/2017  45-49      F        109        111        114      1129773     252  358.189997                13                    2
 1314411      19/08/2017    19/08/2017  45-49      F        110        111        116       637549     120  173.880003                 3                    0
 1314412      19/08/2017    19/08/2017  45-49      F        111        113        117       151531      28   40.289999                 2                    0
 1314414      17/08/2017    17/08/2017  45-49      F        113        114        117       790253     135  198.710000                 8                    2
 1314415      17/08/2017    17/08/2017  45-49      F        114        116        118       513161     114  165.609999                 5                    2

相关问题 更多 >