将json数据转换为pandas datafram

2024-04-20 10:21:25 发布

您现在位置:Python中文网/ 问答频道 /正文

{I{I可以使用python{1>包中的geo-addresses与其他geo-ides合并使用。在

我有一个包含我所有街道地址的csv文件,这段代码可以很好地加载程序,引入数据,并用geocode函数遍历每个程序:

#For geocoding: 
import censusgeocode as cg

#For data handling: 
import pandas as pd

addresses = pd.read_csv('addresslist.csv') 
geo_set = []
#just test it for three addresses 
for index, row in addresses.iloc[0:2].iterrows():
     try:
         nextline = cg.address(str(row['residential_address']), city=str(row['mailing_city']), state=str(row['mailing_state']), zipcode=str(row['mailing_zip_code']))
         nextline
         geo_set.append(nextline)
     except:
         pass

这就是背景;以上所有的工作都很好。我正在努力的是将结果输出转换成pandas数据帧。这是我的代码:

^{pr2}$

我尝试过改变一百万个不同的东西,并不断收到错误消息。有谁能告诉我代码是怎么出问题的吗。我很确定这与我如何理解嵌套结构有关。我收到的错误是:

TypeError: list indices must be integers or slices, not str

以下是我试图将其制成数据帧的数据:

[[{'addressComponents': {'city': 'BOULDER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80211'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '080300028024003',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5040',
      'NAME': 'Block 4113',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'status': 'Layer query encountered an error: java.lang.RuntimeException: Failed to return'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198131',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+38.9976179',
      'CENTLON': '-105.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+38.9938482',
      'INTPTLON': '-105.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E BAYAUD AVE, DENVER, CO, 80209',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}],
 [{'addressComponents': {'city': 'DENVER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80209'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '033',
      'FUNCSTAT': 'S',
      'GEOID': '080330028024113',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5041',
      'NAME': 'Block 4233',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'AREALAND': 886991,
      'AREAWATER': 0,
      'BASENAME': '32.02',
      'CENTLAT': '+43.7177365',
      'CENTLON': '-135.9841763',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '08033002802',
      'INTPTLAT': '+43.7177365',
      'INTPTLON': '-135.9841763',
      'LSADC': 'CT',
      'MTFCC': 'G5020',
      'NAME': 'Census Tract 41.02',
      'OBJECTID': 65498,
      'OID': 20790703831619,
      'STATE': '08',
      'TRACT': '002802'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198133',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+43.9976179',
      'CENTLON': '-135.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+43.9938482',
      'INTPTLON': '-135.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E REVEREND AVE, BOULDER, CO, 88090',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}]]

原岗位增补

我试图在JSON文件的另一部分中再提取几个变量。它们都在树的'2010 Census Tracts'部分。通过运行以下代码(改编自您与我共享的代码):

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            print(g)

我可以打印我想要的树的所有多余部分。但当我尝试将其集成到提取变量并将其附加到数据帧的部分时,我得到了与之前相同的TypeError消息。在

这是我的代码:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            new_result = pd.DataFrame({
                "fromAddress":[d['fromAddress']],
                "streetName":[d['streetName']],
                "suffixType":[d['suffixType']],
                "state":[d['state']],
                "city":[d['city']],
                "zip":[d['zip']],
                "BASENAME":[g['BASENAME']],
                "CENTLAT":[g['CENTLAT']], 
                "COUNTY":[g['COUNTY']], 
                "GEOID":[g['GEOID']], 
                "NAME":[g['NAME']], 
                "BLKGRP":[g['BLKGRP']], 
                "BLOCK":[g['BLOCK']] 
            })
            emptydata = emptydata.append(new_result)

Tags: nameincityforzipstatecensuscounty
2条回答

这里的问题是嵌套的复杂性,并且嵌套的for循环无法到达内层。您的输出包含一个嵌套在嵌套字典列表中的列表。当您尝试迭代geo_set一个级别时,p['addressComponents']失败,因为{}是嵌套字典的列表,而不是您预期的字典。您需要再次遍历p以访问包含键'addressComponents'的迭代字典i,其中现在包含了您要检索的所有项:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        add_comp = i['addressComponents']
        census_block = i['geographies']['2010 Census Blocks'][0]
        new_result = pd.DataFrame({
            "fromAddress":[add_comp['fromAddress']],
            "streetName":[add_comp['streetName']],
            "suffixType":[add_comp['suffixType']],
            "state":[add_comp['state']],
            "city":[add_comp['city']],
            "zip":[add_comp['zip']],
            "BASENAME": [census_block['BASENAME']],
            "CENTLAT": [census_block['CENTLAT']],
            "COUNTY": [census_block['COUNTY']],
            "GEOID": [census_block['GEOID']],
            "NAME": [census_block['NAME']],
            "BLKGRP": [census_block['BLKGRP']],
            "BLOCK": [census_block['BLOCK']]
        })
        emptydata = emptydata.append(new_result)

输出空数据:

  BASENAME BLKGRP BLOCK      CENTLAT COUNTY            GEOID        NAME  \
0     4003      4  4003  +43.7156677    031  080300028024003  Block 4113   
0     4003      4  4003  +43.7156677    033  080330028024113  Block 4233   

      city fromAddress state streetName suffixType    zip  
0  BOULDER           1    CO   REVEREND        AVE  80211  
0   DENVER           1    CO   REVEREND        AVE  80209

作为参考,这些对于调试来说是微不足道的—您收到的TypeError: list indices must be integers or slices, not str是一个很好的提示,说明切片出错了。由于切片使用[]语法,还有什么使用相同的语法?字典键,即p['addressComponents']。如果你尝试过:

^{pr2}$

你也会收到同样的错误。现在,您已经成功地缩小了错误的来源,并且可以通过逐步查看数据来逐步缩小错误的来源。在


替代方案:

如果你不想让你的代码变得这么重字符串,下面是一个字典驱动的方法:

^{3}$

输出是相同的,并且您不会最终创建这么多临时数据帧对象。不过,需要注意的是DataFrame的设置现在不太可读。在

同样,跟踪数据中的列表和字典,并进行相应的迭代。在

你只需:

emptydata = pd.DataFrame([{
        "fromAddress":[i['fromAddress']],
        "streetName":[i['streetName']],
        "suffixType":[i['suffixType']],               
        "state":[i['state']],                   
        "city":[i['city']],               
        "zip":[i['zip']]
    } for p in geo_set for i in p['addressComponents']])

相关问题 更多 >