从不一致命名的列创建数据框架

2024-03-29 11:50:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas.DataFrame,由于文件(.csv)的命名不一致,它有多余的列名。这将导致列的值大多为NaN

Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType
 NaN       NaN     W20848      NaN             Subscriber
 NaN       NaN     W20231      NaN             Subscriber
 NaN       NaN     W00785      NaN             Subscriber
 NaN       NaN     W00126      NaN             Subscriber
 NaN       NaN     W20929      NaN             Casual

有没有办法创建一个新列并从多个有值的列中填充它?如果不止一列不是NaN,我可以选择从哪个列中提取值吗

 Bike#   Bikenumber   Bike #   Selected_Num
number1   number2      NaN       number2

当我尝试用一列来填充时,我可以得到这个

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)

    Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType   Bike_Num
     NaN       NaN     W20848      NaN             Subscriber       W20848
     NaN       NaN     W20231      NaN             Subscriber       W20231
     NaN       NaN     W00785      NaN             Subscriber       W00785
     NaN       NaN     W00126      NaN             Subscriber       W00126
     NaN       NaN     W20929      NaN             Casual           W20929

这失败了

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Tags: samplenannumsubscriberbikenumber2casualbikenumber
1条回答
网友
1楼 · 发布于 2024-03-29 11:50:26

我建议你在阅读CSV的时候解决这个问题,而不是稍后尝试解开它们。一种方法是在将CSV文件传递给pandas之前,对它们使用一个小型解析器

这个解析器接受一个指向csv的打开文件句柄,以及一个将所需列名映射到各种可能的同义词的dict

代码:

def read_my_csv(file_handle, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and map columns to desired names
    header = next(reader)
    header = [synoms.get(c, c) for c in header]

    # yield the header
    yield header

    # yield the remaining rows
    for row in reader:
        yield row

测试代码:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

with open("sample.csv", 'rU') as f:
    generator = read_my_csv(f, column_map)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

示例.csv:

Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual

结果:

  Bike_Num     Sub_Num
0   W20848  Subscriber
1   W20231  Subscriber
2   W00785  Subscriber
3   W00126  Subscriber
4   W20929      Casual

解决方案#2

一个更干净,但不是那么有趣的解决方案是在执行concat之前重命名列:

代码:

def fix_column_names(df, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # rename columns
    df.columns = [synoms.get(c, c) for c in df.columns]

测试代码:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)

相关问题 更多 >