Pandas半结构化JSON数据帧到简单Pandas datafram

1条回答

网友

1楼 · 发布于 2024-04-27 19:32:23

将上面的输入字符串作为一个名为'data'的变量，这个Python+pyparsing代码可以理解它。不幸的是，第四个“|”右边的东西并不是真正的JSON。幸运的是，它的格式非常好，可以在没有过度不适的情况下进行解析。请参阅下面程序中嵌入的注释：

from pyparsing import *
from datetime import datetime

# for the most part, we suppress punctuation - it's important at parse time
# but just gets in the way afterwards
LBRACE,RBRACE,COLON,DBLQ,LBRACK,RBRACK = map(Suppress, '{}:"[]')
DBLQ2 = DBLQ + DBLQ

# define some scalar value expressions, including parse-time conversion parse actions
realnum = Regex(r'[+-]?\d+\.\d*').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
timestamp = Regex(r'""\d{4}-\d{2}-\d{2}T\d{2}:\d{2}""')
timestamp.setParseAction(lambda t: datetime.strptime(t[0][2:-2],'%Y-%m-%dT%H:%M'))
string_value = QuotedString('""')

# define our base key ':' value expression; use a Forward() placeholder
# for now for value, since these things can be recursive
key = Optional(DBLQ2) + Word(alphas, alphanums+'_') + DBLQ2
value = Forward()
key_value = Group(key + COLON + value)

# objects can be values too - use the Dict class to capture keys as field names
obj = Group(Dict(LBRACE + OneOrMore(key_value) + RBRACE))
objlist = (LBRACK + ZeroOrMore(obj) + RBRACK)

# define expression for previously-declared value, using <<= operator
value <<= timestamp | string_value | realnum | integer | obj | Group(objlist)

# the outermost objects are enclosed in "s, and list of them can be given with '|' delims
quotedObj = DBLQ + obj + DBLQ
obsList = delimitedList(quotedObj, delim='|')

现在将该解析器应用于您的“数据”：

^{pr2}$
给出：
[['currency', 'EUR'], ['item_id', '143'], ['type', 'FLIGHT'], ['name', 'PAR-FEZ'], ['price', 1111], ['origin', 'PAR'], ['destination', 'FEZ'], ['merchant', 'GOV'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]]]] - currency: EUR - destination: FEZ - flight_segment: [0]: [['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']] - arrival_date_time: 2015-08-02 09:05:00 - carrier: AT - departure_date_time: 2015-08-02 07:20:00 - destination: FEZ - f_class: ECONOMY - origin: ORY - flight_type: OW - item_id: 143 - merchant: GOV - name: PAR-FEZ - origin: PAR - price: 1111 - type: FLIGHT [['type', 'FLIGHT'], ['name', 'FI_ORY-OUD'], ['item_id', 'FLIGHT'], ['currency', 'EUR'], ['price', 111], ['origin', 'ORY'], ['destination', 'OUD'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]]]] - currency: EUR - destination: OUD - flight_segment: [0]: [['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']] - arrival_date_time: 2015-08-02 15:30:00 - carrier: AT - departure_date_time: 2015-08-02 13:55:00 - destination: OUD - f_class: ECONOMIC_DISCOUNTED - flight_number: AT625 - origin: ORY - flight_type: OW - item_id: FLIGHT - name: FI_ORY-OUD - origin: ORY - price: 111 - type: FLIGHT
注意，不是字符串的值（整数、时间戳等）已经转换为Python类型。由于字段名称保存为dict键，因此可以按名称访问字段，如中所示：
res[0].currency res[0].price res[0].destination res[0].flight_segment[0].origin len(res[0].flight_segment) # gives how many segments

相关问题更多 >

编程相关推荐

热门问题

热门文章