如何将改变结构的csv读入数据帧

2条回答

网友

1楼 · 编辑于 2024-05-23 20:56:39

当您得到有已知故障的坏数据集时，一个好的解决方案是清除数据并将好的数据写回磁盘。该代码只需在下载后运行，而您的其他代码不会因修复程序的错误而负担过重。这最好使用csv模块来完成，我们可以逐行修复

kaggle_campaign_data_fixer.py

import sys
import csv
from pathlib import Path

filename = Path(sys.argv[1])
newname = filename.parent/f"{filename.stem}-fixed{filename.suffix}"

BADCOLS = ['', '']

with open(filename, newline='') as infile, open(newname, 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in csv.reader(infile):
        if row[-2:] == BADCOLS:
            row[3:3] = BADCOLS
            del row[-2:]
        writer.writerow(row)

# test it
import pandas as pd

df = pd.read_csv(filename, header=None)
print(df)
print("""
============== FIXED ==================
""")
df = pd.read_csv(newname, header=None)
print(df)

网友

2楼 · 编辑于 2024-05-23 20:56:39

问题是数据集
从第762行（索引为0）开始，数据不再包含'campaign_id'和'fb_campaign_id'
我建议将数据读入两个独立的数据帧
将缺少的带有None值的列添加到df2
对df2列进行排序以匹配df1的顺序
合并这两个数据帧
来自Facebook Ad Campaign的有问题的数据集似乎是Sales Conversion Optimization的副本，而Sales Conversion Optimization没有这个问题。
- 如本数据集discussion所述
- 因此，在没有附加问题的情况下使用数据集可能是最简单的

import pandas as pd

# read through row 761
df1 = pd.read_csv('data.csv', skiprows=range(762, 1145))

# create a different header for the 2nd part of the data
cols = df1.columns[:3].tolist() + df1.columns[5:].tolist()

# read from row 762 and skip the two blank end columns
df2 = pd.read_csv('data.csv', header=None, names=cols, skiprows=range(762), usecols=range(13))

# add the missing columns to df2
df2[['campaign_id', 'fb_campaign_id']] = None

# sort df2 columns by the same order as df1
df2 = df2.reindex(df1.columns, axis=1)

# combine the two dataframes
df = pd.concat([df1, df2]).reset_index(drop=True)

`df`-头和尾

  ad_id reporting_start reporting_end campaign_id fb_campaign_id    age gender  interest1  interest2  interest3  impressions  clicks  spent  total_conversion  approved_conversion
 708746      17/08/2017    17/08/2017         916         103916  30-34      M         15         17         17         7350       1   1.43                 2                    1
 708749      17/08/2017    17/08/2017         916         103917  30-34      M         16         19         21        17861       2   1.82                 2                    0
 708771      17/08/2017    17/08/2017         916         103920  30-34      M         20         25         22          693       0   0.00                 1                    0
 708815      30/08/2017    30/08/2017         916         103928  30-34      M         28         32         32         4259       1   1.25                 1                    0
 708818      17/08/2017    17/08/2017         916         103928  30-34      M         28         33         32         4133       1   1.29                 1                    1
 1314410      19/08/2017    19/08/2017        None           None  45-49      F        109        111        114      1129773     252  358.189997                13                    2
 1314411      19/08/2017    19/08/2017        None           None  45-49      F        110        111        116       637549     120  173.880003                 3                    0
 1314412      19/08/2017    19/08/2017        None           None  45-49      F        111        113        117       151531      28   40.289999                 2                    0
 1314414      17/08/2017    17/08/2017        None           None  45-49      F        113        114        117       790253     135  198.710000                 8                    2
 1314415      17/08/2017    17/08/2017        None           None  45-49      F        114        116        118       513161     114  165.609999                 5                    2

`df1`-头和尾

  ad_id reporting_start reporting_end  campaign_id  fb_campaign_id    age gender  interest1  interest2  interest3  impressions  clicks  spent  total_conversion  approved_conversion
 708746      17/08/2017    17/08/2017          916          103916  30-34      M         15         17         17         7350       1   1.43                 2                    1
 708749      17/08/2017    17/08/2017          916          103917  30-34      M         16         19         21        17861       2   1.82                 2                    0
 708771      17/08/2017    17/08/2017          916          103920  30-34      M         20         25         22          693       0   0.00                 1                    0
 708815      30/08/2017    30/08/2017          916          103928  30-34      M         28         32         32         4259       1   1.25                 1                    0
 708818      17/08/2017    17/08/2017          916          103928  30-34      M         28         33         32         4133       1   1.29                 1                    1
 1121585      20/08/2017    20/08/2017         1178          144621  40-44      M         66         72         68         9773       1    1.460000                 1                    0
 1121589      20/08/2017    20/08/2017         1178          144622  45-49      M         10         16         11       464036      77  123.550000                 3                    1
 1121590      20/08/2017    20/08/2017         1178          144622  45-49      M         10         16         15       478480      75  135.750001                 3                    1
 1121592      20/08/2017    20/08/2017         1178          144622  45-49      M         10         14         11       428812      66  116.880000                 4                    2
 1121593      26/08/2017    26/08/2017         1178          144622  45-49      M         10         16         16      1177535     221  365.660001                15                    3

`df2`-头部和尾部

   ad_id reporting_start reporting_end    age gender  interest1  interest2  interest3  impressions  clicks       spent  total_conversion  approved_conversion
 1121594      26/08/2017    26/08/2017  45-49      M         10         14         14       426500      72  128.279999                 4                    1
 1121597      30/08/2017    30/08/2017  45-49      M         15         21         19        54237       7   10.780000                 2                    1
 1121598      30/08/2017    30/08/2017  45-49      M         15         19         18       506916      89  133.699999                 2                    2
 1121599      30/08/2017    30/08/2017  45-49      M         15         17         18       250960      42   64.880000                 2                    0
 1121601      30/08/2017    30/08/2017  45-49      M         16         20         18      2286228     353  603.380002                16                    7
 1314410      19/08/2017    19/08/2017  45-49      F        109        111        114      1129773     252  358.189997                13                    2
 1314411      19/08/2017    19/08/2017  45-49      F        110        111        116       637549     120  173.880003                 3                    0
 1314412      19/08/2017    19/08/2017  45-49      F        111        113        117       151531      28   40.289999                 2                    0
 1314414      17/08/2017    17/08/2017  45-49      F        113        114        117       790253     135  198.710000                 8                    2
 1314415      17/08/2017    17/08/2017  45-49      F        114        116        118       513161     114  165.609999                 5                    2

CSV示例行

所需

`df`-头和尾

`df1`-头和尾

`df2`-头部和尾部

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将改变结构的csv读入数据帧

CSV示例行

所需

df-头和尾

df1-头和尾

df2-头部和尾部

相关问题 更多 >

编程相关推荐

热门问题

热门文章

`df`-头和尾

`df1`-头和尾

`df2`-头部和尾部

相关问题更多 >