python csv.dictreader 无法读取 data.gov 的 csv 文件

4 投票
4 回答
5590 浏览
提问于 2025-04-16 16:53

我在使用来自data.gov的一些随机CSV数据,比如:“2011年1月夏威夷退伍军人和受益人的墓地位置” http://www.data.gov/raw/4608,我想用Python来解析这个CSV文件,并处理每一行数据:

randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
     for row in randomData:
         print row

这是示例CSV数据:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war

Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","111444","","SXXXXX","Veteran (Self)","Joe","E","JoJo","","US ARMY","SGT","WORLD WAR II"

结果看起来不是很好(打印一行数据):

{'v_last_name': None, 'cem_addr_two': None, 'rank': None, 'd_suffix': None, 'city': None, 'row_num': None, 'zip': None, 'cem_phone': None, 'd_last_name': None, 'd_first_name': 'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","11144","SXXXXX","","US ARMY","SGT","WORLD WAR II"', 'war': None, 'v_mid_name': None, 'cem_url': None, 'cem_name': None, 'relationship': None, 'v_first_name': None, 'cem_addr_one': None, 'd_birth_date': None, 'd_death_date': None}

你可以看到,表头字段(CSV的第一行)没有正确地和后面的每一行数据对应上。

我是不是做错了什么,还是说这个CSV文件的质量不好?

感谢Casey问我是否在其他程序中打开了这个文件。Excel搞乱了这个文件……

4 个回答

1

我刚试了一下,结果和你的文件(改名为foo)一起运行得很好。

import csv

ifile  = open('foo.csv', "rb")
reader = csv.reader(ifile)

rownum = 0
for row in reader:
    # Save header row.
    if rownum == 0:
        header = row
    else:
        colnum = 0
        for col in row:
            print '%-8s: %s' % (header[colnum], col)
            colnum += 1

    rownum += 1

ifile.close()

输出结果=

d_first_name: Emil
d_mid_name: E
d_last_name: Seibel
d_suffix: 
d_birth_date: 10/02/1920
d_death_date: 03/12/2010
section_id: 139-E
row_num : 
site_num: 3
cem_name: HAWAII STATE VETERANS CEMETERY
cem_addr_one: KAMEHAMEHA HIGHWAY
cem_addr_two: 
city    : KANEOHE
state   : HI
zip     : 96744
cem_url : 
cem_phone: 808-233-3630
relationship: Veteran (Self)
v_first_name: Emil
v_mid_name: E
v_last_name: Seibel
v_suffix: 
branch  : US ARMY
rank    : SGT
war     : WORLD WAR II
3

奇怪,我得到的输出和你不一样。

数据文件 data.csv:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744","","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",

脚本:

for line in csv.DictReader(open('data.csv', 'rb'), delimiter=","):
    print line

输出:

{'v_last_name': 'Seibel', None: [''], 'cem_addr_two': '', 'rank': 'SGT', 'd_suffix': '', 'city': 'KANEOHE', 'row_num': '', 'zip': '96744', 'cem_phone': '808-233-3630', 'd_
last_name': 'Seibel', 'd_mid_name': 'E', 'state': 'HI', 'branch': 'US ARMY', 'd_first_name': 'Emil', 'war': 'WORLD WAR II', 'v_mid_name': 'E', 'cem_url': '', 'cem_name': '
HAWAII STATE VETERANS CEMETERY', 'relationship': 'Veteran (Self)', 'v_first_name': 'Emil', 'section_id': '139-E', 'v_suffix': '', 'site_num': '3', 'cem_addr_one': 'KAMEHAM
EHA HIGHWAY', 'd_birth_date': '10/02/1920', 'd_death_date': '03/12/2010'}

csv.DictReader 本来应该自动从文件的第一行获取字段名称,如果没有提供 fieldnames 参数,文档中有说明

输出中的 None: [''] 是因为每行数据末尾多了一个逗号。

有效的代码示例:

http://codepad.org/HdBhr4La

2

我下载的原始文件在这里,它是有效的CSV格式。我之前误解了你脚本的输出。

因为你使用了csv.DictReader,所以每一行数据都会变成一个字典,字典的键是表头的值,而对应的数据就是字典的值。我在同一个文件上运行了它,结果看起来一切都对上了,虽然我没有逐行检查。

根据Python文档

class csv.DictReader(csvfile[, fieldnames=None[, restkey=None[, restval=None[, dialect='excel'[, *args, **kwds]]]]])

这个类创建了一个对象,它的工作方式像普通的读取器,但读取的信息会被映射到一个字典中,字典的键由可选的fieldnames参数提供。如果没有提供fieldnames参数,CSV文件第一行的值将被用作字段名。如果读取的行字段比fieldnames多,剩下的数据会作为一个序列,使用restkey的值作为键。如果读取的行字段比fieldnames少,剩下的键会使用可选的restval参数的值。其他任何可选或关键字参数都会传递给底层的读取器实例。

如果这不是你想要的格式,你可以尝试使用csv.reader,它会返回每一行的列表,而不会和表头关联起来。

如果你想使用上面的DictReader,可能你需要的是:

import csv
reader = csv.DictReader(open('ngl_hawaii.csv', 'rb'), delimiter=','))
for row in reader:
    print row['d_first_name']
    print row['d_last_name']

撰写回答