使用pandas加载大规模json文件
我有500多个很大的json文件,每个文件压缩后有400MB(解压后是3GB)。我在用Python 2.7的标准json库来处理这些数据,但处理的时间太长了,我觉得json.loads()
是耗时的主要原因。我在考虑用Python的pandas库来从gzip文件中加载数据并进行分析。
我刚听说过pandas,不太确定这是不是合适的工具。我担心的是,使用pandas能否在速度上有明显的提升?
另外,我当然可以把工作并行处理来提高速度,但我还是觉得速度很慢。
还有,通过gzip.open()
读取数据,然后用json.loads()
把json转换成字典,再存入sqlite3,这样做对后续分析有没有帮助呢?
json条目的示例:
{"created_at":"Sun Dec 01 01:19:00 +0000 2013","id":406955558441193472,"id_str":"406955558441193472","text":"Todo va a estar bn :D","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":483470963,"id_str":"483470963","name":"katheryn Rodriguez","screen_name":"katheryn_93","location":"","url":null,"description":"No pretendo ser nadie mas y no soy perfecta lo se, tengo muchos errores tambi\u00e9n lo se pero me acepto y me amo como soy.","protected":false,"followers_count":71,"friends_count":64,"listed_count":0,"created_at":"Sun Feb 05 02:04:16 +0000 2012","favourites_count":218,"utc_offset":-21600,"time_zone":"Central Time (US & Canada)","geo_enabled":true,"verified":false,"statuses_count":10407,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/483470963\/1385144720","profile_link_color":"9D1DCF","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}
偶尔你会看到这样的json条目:
{"delete":{"status":"id":380315814080937984,"user_id":318430801,"id_str":"380315814080937984","user_id_str":"318430801"}}}
1 个回答
0
3GB的json文件在Python中存成嵌套字典会非常大,可能会比原来的文件大好几倍,这样就会占用很多内存。你可以观察一下在加载这些文件时,内存使用量是如何增加的,你会发现你的电脑可能开始使用交换空间了。
你需要要么逐行解析这些json文件(如果它们是json格式的话),要么把文件分成更小的部分。