使用pandas加载大规模json文件

Question

我有500多个很大的json文件，每个文件压缩后有400MB（解压后是3GB）。我在用Python 2.7的标准json库来处理这些数据，但处理的时间太长了，我觉得json.loads()是耗时的主要原因。我在考虑用Python的pandas库来从gzip文件中加载数据并进行分析。

我刚听说过pandas，不太确定这是不是合适的工具。我担心的是，使用pandas能否在速度上有明显的提升？

另外，我当然可以把工作并行处理来提高速度，但我还是觉得速度很慢。

还有，通过gzip.open()读取数据，然后用json.loads()把json转换成字典，再存入sqlite3，这样做对后续分析有没有帮助呢？

json条目的示例：

 {"created_at":"Sun Dec 01 01:19:00 +0000 2013","id":406955558441193472,"id_str":"406955558441193472","text":"Todo va a estar bn :D","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":483470963,"id_str":"483470963","name":"katheryn Rodriguez","screen_name":"katheryn_93","location":"","url":null,"description":"No pretendo ser nadie mas y no soy perfecta lo se, tengo muchos errores tambi\u00e9n lo se pero me acepto y me amo como soy.","protected":false,"followers_count":71,"friends_count":64,"listed_count":0,"created_at":"Sun Feb 05 02:04:16 +0000 2012","favourites_count":218,"utc_offset":-21600,"time_zone":"Central Time (US & Canada)","geo_enabled":true,"verified":false,"statuses_count":10407,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/483470963\/1385144720","profile_link_color":"9D1DCF","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}

偶尔你会看到这样的json条目：

{"delete":{"status":"id":380315814080937984,"user_id":318430801,"id_str":"380315814080937984","user_id_str":"318430801"}}}

性能优化数据分析 sqlite3 数据加载 pandas 并行处理 json处理 gzip文件

使用pandas加载大规模json文件

1 个回答

撰写回答