如何在Python中解析包含非拉丁字符的JSON文件,并将其输出为一个列表的列表?

2024-05-23 19:43:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个json文件,看起来像这样:

{"id": 1, "text": "\"Sathon, Bangkok 10120, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[25, 33, "Country"], [18, 23, "PostCode"], [2, 8, "District"], [10, 17, "Province"]]}
{"id": 2, "text": "\"8/89 ซอยหมู่บ้านหนองแก Tambon Nong Kae, Amphoe Hua Hin, Chang Wat Prachuap Khiri Khan 77110, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[6, 23, "Plot"], [1, 5, "Plot"], [24, 30, "SubDistrictKeyword"], [31, 39, "SubDistrict"], [41, 47, "DistrictKeyword"], [48, 55, "District"], [57, 66, "ProvinceKeyword"], [67, 86, "Province"], [87, 92, "PostCode"], [94, 102, "Country"]]}
{"id": 3, "text": "\"1291, 1293 Sutthisan Winitchai Rd, Khwaeng Din Daeng, Khet Din Daeng, Krung Thep Maha Nakhon 10400, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 11, "HouseNumber"], [12, 31, "Street"], [32, 35, "StreetKeyword"], [36, 43, "SubDistrictKeyword"], [55, 59, "DistrictKeyword"], [60, 70, "DistrictKeyword"], [94, 100, "PostCode"], [44, 54, "SubDistrict"], [101, 109, "Country"], [71, 93, "Province"]]}
{"id": 4, "text": "\"23, 21 ถนน พระราม ๒ Khwaeng Bang Mot, Khet Chom Thong, Krung Thep Maha Nakhon 10150, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[8, 20, "Street"], [21, 28, "SubDistrictKeyword"], [1, 7, "HouseNumber"], [29, 38, "SubDistrict"], [39, 43, "DistrictKeyword"], [44, 55, "District"], [56, 78, "Province"], [79, 85, "PostCode"], [86, 94, "Country"]]}
{"id": 5, "text": "\"Bang Na-Trat Frontage Rd, Khwaeng Bang Na, Khet Bang Na, Krung Thep Maha Nakhon 10260, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 22, "Street"], [23, 26, "StreetKeyword"], [27, 34, "SubDistrictKeyword"], [35, 43, "SubDistrict"], [44, 48, "DistrictKeyword"], [49, 57, "District"], [58, 80, "Province"], [81, 87, "Plot"], [88, 96, "Country"]]}
{"id": 6, "text": "\"Florida, USA\"", "meta": {}, "annotation_approver": null, "labels": [[1, 9, "City"], [10, 13, "Country"]]}
{"id": 7, "text": "\"Thapae Rd, Amphoe Mueang Chiang Mai, Chang Wat Chiang Mai 50300, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 7, "Street"], [8, 11, "StreetKeyword"], [12, 18, "DistrictKeyword"], [19, 37, "District"], [38, 47, "ProvinceKeyword"], [48, 58, "Province"], [59, 65, "PostCode"], [66, 74, "Country"]]}
{"id": 8, "text": "\"Bangkok, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 9, "City"], [10, 18, "Country"]]}
{"id": 9, "text": "\"31/3 Beach, Ao Nang, Krabi, 81180, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 5, "HouseNumber"], [6, 12, "Street"], [13, 21, "City"], [22, 28, "Province"], [29, 35, "PostCode"], [36, 44, "Country"]]}
{"id": 10, "text": "\"Mueang Suphan Buri District, Suphan Buri, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 19, "District"], [20, 29, "DistrictKeyword"], [30, 42, "Province"], [43, 51, "Country"]]}
{"id": 11, "text": "\"Mueang Suphan Buri District, Suphan Buri, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 19, "District"], [20, 29, "DistrictKeyword"], [30, 42, "Province"], [43, 51, "Country"]]}
{"id": 12, "text": "\"Mueang Suphan Buri District, Suphan Buri, Thailand\"", "meta": {}, "annotation_approver": null, "labels": []}
{"id": 13, "text": "\"1 ซอย 20 ถนน สุขุมวิท Khwaeng Khlong Toei, Khet Khlong Toei, Krung Thep Maha Nakhon 10110, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 2, "HouseNumber"], [3, 22, "Street"], [23, 30, "SubDistrictKeyword"], [31, 43, "District"], [44, 48, "DistrictKeyword"], [49, 61, "District"], [62, 84, "Province"], [85, 91, "PostCode"], [92, 100, "Country"]]}
{"id": 14, "text": "\"Ekkamai Rd, Phra Khanong Nuea, Khet Watthana, Krung Thep Maha Nakhon 10110, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 8, "Street"], [9, 12, "StreetKeyword"], [13, 31, "District"], [32, 36, "DistrictKeyword"], [37, 46, "District"], [47, 69, "Province"], [70, 76, "PostCode"], [77, 85, "Country"]]}
{"id": 15, "text": "\"587, 589 , 589/7-9 Fashion Island Thanon Ram Intra, Khan Na Yao, Bangkok 10230, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 19, "HouseNumber"], [20, 52, "Street"], [53, 65, "District"], [66, 73, "Province"], [74, 80, "PostCode"], [81, 89, "Country"]]}
{"id": 16, "text": "\"Nong Prue, Pattaya City, Bang Lamung District, Chon Buri 20150, Thailand\"", "meta": {}, "annotation_approver": null, "labels": []}
{"id": 17, "text": "\"76/1 Maharaj Rd, Tambon Pak Nam, Amphoe Mueang Krabi, Chang Wat Krabi 81000, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 5, "HouseNumber"], [6, 13, "Street"], [14, 17, "StreetKeyword"], [18, 24, "SubDistrictKeyword"], [25, 33, "SubDistrict"], [34, 40, "DistrictKeyword"], [41, 54, "District"], [55, 64, "ProvinceKeyword"], [65, 70, "Province"], [71, 77, "PostCode"], [78, 86, "Country"]]}
{"id": 18, "text": "\"622 Emporium Tower 23rd Floor, Sukhumvit 24 Road, Klongton, Klongtoey, Bangkok 10110, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 4, "HouseNumber"], [5, 50, "Street"], [51, 71, "SubDistrict"], [72, 79, "Province"], [80, 86, "PostCode"], [87, 95, "Country"]]}
{"id": 19, "text": "\"Lam Luk Ka District, Pathum Thani, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 11, "District"], [12, 21, "DistrictKeyword"], [22, 35, "Province"], [36, 44, "Country"]]}
{"id": 20, "text": "\"Samut Prakan, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 14, "Province"], [15, 23, "Country"]]}
{"id": 21, "text": "\"607 Phet Kasem Rd, Khwaeng Bang Wa, Khet Phasi Charoen, Krung Thep Maha Nakhon 10160, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 4, "HouseNumber"], [5, 15, "Street"], [16, 19, "StreetKeyword"], [20, 27, "SubDistrictKeyword"], [28, 36, "SubDistrict"], [37, 41, "DistrictKeyword"], [42, 56, "District"], [57, 79, "Province"], [80, 86, "PostCode"], [87, 95, "Country"]]}
{"id": 22, "text": "\"4th Floor , Central Chidlom 1027 Phloen Chit Rd, Lumphini, Pathum Wan District, Bangkok 10331, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 45, "Street"], [46, 49, "StreetKeyword"], [50, 59, "SubDistrict"], [60, 70, "District"], [71, 80, "DistrictKeyword"], [81, 88, "Province"], [89, 95, "PostCode"], [96, 104, "Country"]]}
{"id": 23, "text": "\"233 S Sathon Rd, Khwaeng Yan Nawa, Khet Sathon, Krung Thep Maha Nakhon 10120, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 6, "HouseNumber"], [7, 17, "Street"], [18, 25, "SubDistrictKeyword"], [26, 35, "SubDistrict"], [36, 40, "DistrictKeyword"], [41, 48, "District"], [49, 71, "Province"], [72, 78, "PostCode"], [79, 87, "Country"]]}
{"id": 24, "text": "\"Pa Tong, Kathu District, Phuket 83150, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 9, "City"], [10, 15, "District"], [16, 25, "DistrictKeyword"], [26, 32, "Province"], [33, 39, "PostCode"], [40, 48, "Country"]]}
{"id": 25, "text": "\"622 Sukhumvit Rd, Khwaeng Khlong Tan, Khet Khlong Toei, Krung Thep Maha Nakhon 10110, Thailand\"", "meta": {}, "annotation_approver": null, "labels": [[1, 4, "HouseNumber"], [5, 14, "Street"], [15, 18, "StreetKeyword"], [19, 26, "SubDistrictKeyword"], [27, 38, "SubDistrict"], [39, 43, "DistrictKeyword"], [44, 56, "District"], [57, 79, "Province"], [80, 86, "PostCode"], [87, 95, "Country"]]}

我需要解析这个文件并提供一个列表列表作为输出。 输出:

[('ซอยหมู่บ้านหนองแก', 'Plot'), ('8', 'Plot'), ('/', 'Plot'), ('89', 'Plot'), ('Tambon', 'SubDistrictKeyword'), ('Nong', 'SubDistrict'), ('Kae', 'SubDistrict'), ('Amphoe', 'DistrictKeyword'), ('Hua', 'District'), ('Hin', 'District'), ('Chang', 'ProvinceKeyword'), ('Wat', 'ProvinceKeyword'), ('Prachuap', 'Province'), ('Khiri', 'Province'), ('Khan', 'Province'), ('77110', 'PostCode'), ('Thailand', 'Country')]

每个元组由一个“标记”-“标签”对组成。标记是dict中包含的文本的子字符串,子字符串开始和结束的索引在“labels”键中提到

我试过:

import json
data=json.load('sample_json')

但是,我得到了这个错误:AttributeError:'str'对象没有属性'read'

我想这是因为非拉丁字符,但我找不到任何东西来解决它


Tags: textidstreetlabelsannotationcountrynullmeta
3条回答

首先,json.load接受文件流,而不是str。您必须使用json.loads,或者将打开的文件作为参数传递。
您还需要指定读取非ascii符号的编码

with open('test.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

如果sample_jsonstr而不是file句柄,那么您需要json.loads而不是json.load(注意s,对于string(或者至少我是这样记得区别的))

您需要先打开文件:

In [1]: import json                                                             
In [2]: f = open("sample_json")                                                 
In [3]: data = json.load(f)                                                     
In [4]: data                                                                    
Out[4]: 
{'id': 2,
 'text': '"8/89 ซอยหมู่บ้านหนองแก Tambon Nong Kae, Amphoe Hua Hin, Chang Wat Prachuap Khiri Khan 77110, Thailand"',
 'meta': {},
 'annotation_approver': None,
 'labels': [[6, 23, 'Plot'],
  [1, 5, 'Plot'],
  [24, 30, 'SubDistrictKeyword'],
  [31, 39, 'SubDistrict'],
  [41, 47, 'DistrictKeyword'],
  [48, 55, 'District'],
  [57, 66, 'ProvinceKeyword'],
  [67, 86, 'Province'],
  [87, 92, 'PostCode'],
  [94, 102, 'Country']]}

相关问题 更多 >