市场篮子分析
我有一个关于零售店的交易数据集,使用的是pandas工具:
print(df)
product Date Assistant_name
product_1 2017-01-02 11:45:00 John
product_2 2017-01-02 11:45:00 John
product_3 2017-01-02 11:55:00 Mark
...
我想要创建一个新的数据集,用于市场购物篮分析:
product Date Assistant_name Invoice_number
product_1 2017-01-02 11:45:00 John 1
product_2 2017-01-02 11:45:00 John 1
product_3 2017-01-02 11:55:00 Mark 2
...
简单来说,如果一笔交易有相同的助手名字和日期,我就认为这会生成一张新的发票。
相关问题:
- 暂无相关问题
2 个回答
1
使用 pandas
的分类功能是一种方法:
df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
# product Date Assistant_name Invoice
# product_1 2017-01-02 11:45:00 John 1
# product_2 2017-01-02 11:45:00 John 1
# product_3 2017-01-02 11:55:00 Mark 2
这种方法的好处是,你可以很方便地获取一个映射的字典:
dict(enumerate(df['Invoice'].astype('category').cat.categories, 1))
# {1: ('11:45:00', 'John'), 2: ('11:55:00', 'Mark')}
4
最简单的方法是使用factorize
,把多个列合在一起:
df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
print (df)
product Date Assistant_name Invoice
0 product_1 2017-01-02 11:45:00 John 1
1 product_2 2017-01-02 11:45:00 John 1
2 product_3 2017-01-02 11:55:00 Mark 2
如果性能很重要,可以使用pd.lib.fast_zip
:
df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
时间测试:
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [178]: %%timeit
...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
...:
9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: %%timeit
...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
...:
11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [180]: %%timeit
...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
...:
6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)