Python bcolz 如何合并两个ctable

11 投票
1 回答
1483 浏览
提问于 2025-04-20 15:12

我在玩这个bcolz的内存压缩示例,来自这个笔记本

到目前为止,我对这个库感到非常惊讶。我觉得它是一个很棒的工具,适合我们这些想把大文件加载到小内存中的人(如果你在看这条信息,Francesc,干得不错!)

我想知道有没有人有经验在ctables之间合并,就像用pandas.merge()那样,怎么才能在时间和内存上都有效率。

谢谢大家分享你们的想法 :-)!

1 个回答

5

我刚好及时搞定了这个问题.. 非常感谢@mdurant提供的itertoolz!! 这里有一些伪代码,因为我之前用的例子实在是太丑了。

# here's generic pandas
df_new = pd.merge(df1,df2) 


# example with itertoolz and bcolz
from toolz.itertoolz import join as joinz
import bcolz

#convert them to ctables
zdf1 = bcolz.ctable.fromdataframe(df1)
zdf2 = bcolz.ctable.fromdataframe(df2)

#column 2 of df1 and column 1 of df2 were the columns to join on
merged = list(joinz(1,zdf1.iter(),0,zdf2.iter()))

# where new_dtypes are the dtypes of the fields you are using
# mine new_dtypes= '|S8,|S8,|S8,|S8,|S8'
zdf3 = bcolz.fromiter(((a[0]+a[1]) for a in merged), dtype = new_dtypes, count = len(merged))

显然,可能还有一些更聪明的方法,这个例子也不是特别具体,但它能工作,并且可以作为其他人进一步开发的基础。

更新示例 10月21日,东部时间晚上7点

#download movielens data files from http://grouplens.org/datasets/movielens/
#I'm using the 1M dataset
import pandas as pd
import time
from toolz.itertoolz import join as joinz
import bcolz

t0 = time()
dset = '/Path/To/Your/Data/'
udata = os.path.join(dset, 'users.dat') 
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(udata,sep='::',names=u_cols)

rdata = os.path.join(dset, 'ratings.dat')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(rdata, sep='::', names=r_cols)

print ("Time for parsing the data: %.2f" % (time()-t0,)) 
#Time for parsing the data: 4.72

t0=time()
users_ratings = pd.merge(users,ratings)
print ("Time for merging the data: %.2f" % (time()-t0,))
#Time for merging the data: 0.14

t0=time()
zratings = bcolz.ctable.fromdataframe(ratings)
zusers = bcolz.ctable.fromdataframe(users)
print ("Time for ctable conversion: %.2f" % (time()-t0,))
#Time for ctable conversion: 0.05

new_dtypes = ','.join([x[0].str for x in zusers.dtype.fields.values()][::-1] +[y[0].str for y in zratings.dtype.fields.values()][::-1])

#Do the merge with a list stored intermediately
t0 = time()
merged = list(joinz(0,zusers.iter(),0,zratings.iter()))
zuser_zrating1 = bcolz.fromiter(((a[0]+a[1]) for a in merged), dtype = new_dtypes, count = len(merged))
print ("Time for intermediate list bcolz merge: %.2f" % (time()-t0,))
#Time for intermediate list bcolz merge: 3.16

# Do the merge ONLY using iterators to limit memory consumption
t0 = time()
zuser_zrating2 = bcolz.fromiter(((a[0]+a[1]) for a in joinz(0,zusers.iter(),0,zratings.iter())) , dtype = new_dtypes, count = sum(1 for _ in joinz(0,zusers.iter(),0,zratings.iter())))
print ("Time for 2x iters of merged bcolz: %.2f" % (time()-t0,))
#Time for 2x iters of merged bcolz: 3.31

如你所见,我创建的这个版本比pandas慢了15倍,不过只使用迭代器的话,可以节省很多内存。欢迎大家评论和/或扩展这个内容。bcolz看起来是一个很不错的包,可以进一步开发。

撰写回答