Python:将文本加载为Python对象

1 投票
4 回答
4617 浏览
提问于 2025-04-16 03:29

我有一段文本需要加载:https://sites.google.com/site/iminside1/paste
我希望能把它转换成一个Python字典,不过其他类型的对象也可以。我试过用picklejsoneval,但都没有成功。你能帮我一下吗?
谢谢!
结果如下:

a = open("the_file", "r").read()

json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)

pickle.loads(a)
KeyError: '{'

eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
    ^
SyntaxError: invalid syntax

4 个回答

1

到现在为止,在delnan的帮助下,以及我自己做了一些调查,我可以用eval把它加载到字典里:

pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)

不过我还是在寻找更高效、可能更简单的解决办法……我不太喜欢出于某些原因使用“eval”。

3

如果你真的需要加载这些数据……(看看我的评论),你可能最好用正则表达式来添加缺失的引号。可以用类似 r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" 的方式来找到需要加引号的部分,然后用 r"\'\1\'\:" 来替换(这是我随便想的,我还得先测试一下)。

编辑:在处理Python 3.1中的反向引用时遇到了一些麻烦,最后我用这些方法成功了:

>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}
4

这段内容几乎是直接从pyparsing的示例页面复制过来的:

# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()

start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text

# parse dict-like syntax    
from pyparsing import (Suppress, Regex, quotedString, Word, alphas, 
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)

LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()

key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))

result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent="  ")
print result.data[0].segments[0].flights[0].dump(indent="  -  ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent="  -  -  ")
for seg in result.data[6].segments:
    for flt in seg.flights:
        fltleg = flt.flightLegs[0]
        print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
        print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)

打印结果:

[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ... 
- serviceClass: ??????
  [['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
  - flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
  - indexSegment: 0
  - stopsCount: 0
  -  [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
  -  - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air... 
  -  - index: 0
  -  - minAvailSeats: 9
  -  - stops: []
  -  - time: PT2H45M
  -  -  [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe... 
  -  -  - airline: ?????????
  -  -  - airlineCode: UN
  -  -  - airplane: Boeing 737-500
  -  -  - availSeats: 9
  -  -  - classCode: I
  -  -  - eTicketing: true
  -  -  - fareBasis: IPROW
  -  -  - flightClass: ECONOMY
  -  -  - flightNo: 309
  -  -  - from:   -  -  [['code', 'DME'], ['airport', '??????????'], ... 
  -  -    - airport: ??????????
  -  -    - city: ??????
  -  -    - code: DME
  -  -    - country: ??????
  -  -    - terminal: 
  -  -  - fromDate: 2010-10-15
  -  -  - fromTime: 10:40:00
  -  -  - time: 
  -  -  - to:   -  -  [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ... 
  -  -    - airport: Berlin-Tegel
  -  -    - city: ??????
  -  -    - code: TXL
  -  -    - country: ????????
  -  -    - terminal: 
  -  -  - toDate: 2010-10-15
  -  -  - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX

编辑: 修正了分组,并扩展了输出内容,展示了如何通过索引(在列表中)或作为属性(在字典中)访问结果的各个关键字段。

撰写回答