使用pandas加载大型json文件

2024-04-26 05:38:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个500多个巨大的json文件,每个文件大小为400mb,当压缩格式(3gig,当未压缩时)。我使用Python2.7中的标准json库来处理数据,这花费了太多的时间,我认为{}是时间消耗的罪魁祸首。我正在考虑使用python中的pandas从gzip文件加载数据并进行分析。在

我刚听说熊猫,不知道这是不是正确的工具使用。我担心的是,使用熊猫会帮助我提高速度吗?在

NB:当然,我可以并行工作,达到速度,但我仍然发现事情相当滞后。在

另外,通过使用gzip.open()读取数据,然后用json.loads()将json转换为字典,然后存储在sqlite3中,这将有助于我进行进一步的分析。在

json条目示例:

 {"created_at":"Sun Dec 01 01:19:00 +0000 2013","id":406955558441193472,"id_str":"406955558441193472","text":"Todo va a estar bn :D","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":483470963,"id_str":"483470963","name":"katheryn Rodriguez","screen_name":"katheryn_93","location":"","url":null,"description":"No pretendo ser nadie mas y no soy perfecta lo se, tengo muchos errores tambi\u00e9n lo se pero me acepto y me amo como soy.","protected":false,"followers_count":71,"friends_count":64,"listed_count":0,"created_at":"Sun Feb 05 02:04:16 +0000 2012","favourites_count":218,"utc_offset":-21600,"time_zone":"Central Time (US & Canada)","geo_enabled":true,"verified":false,"statuses_count":10407,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/483470963\/1385144720","profile_link_color":"9D1DCF","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}

偶尔您可以找到以下类型的json条目:

{"delete":{"status":"id":380315814080937984,"user_id":318430801,"id_str":"380315814080937984","user_id_str":"318430801"}}}


Tags: toinimagecomidjsonfalseurl
1条回答
网友
1楼 · 发布于 2024-04-26 05:38:56

3gbjson文件在python中以嵌套dict形式存储时将非常巨大,很可能要大很多倍,因此需要大量内存。在加载其中一个文件的过程中,观察内存使用量是如何增加的,您很可能会注意到您的计算机开始交换。在

您需要将每一行解析为json(如果有的话),或者将文件分割成更小的块。在

相关问题 更多 >