<p>我们可以用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用这些公式:</p>
<ul>
<li>每个组返回一个<code>Volume</code>/<code>Cost</code>数据帧</li>
<li>或者返回一系列<code>Cost</code>元组和<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>元组</li>
</ul>
<hr/>
<h3>数据帧选项</h3>
<ol>
<li><p>首先将数字字符串转换为实际数字(或者,如果正在使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" rel="nofollow noreferrer">^{<cd6>}</a>加载数据,请使用<code>thousands</code>参数):</p>
<pre><code>df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
</code></pre>
</li>
<li><p>给定<code>Group</code>/<code>Month</code>/<code>ID</code>组,将其<code>Volume</code>和<code>Cost</code>作为数据帧返回:</p>
<pre><code>def formulae_df(g):
# set index to Cost Type for simpler indexing
g = g.set_index('Cost Type')
# initialize empty result df
df = pd.DataFrame(columns=['Volume', 'Cost'], index=['Freight', 'FOB', 'Price']).rename_axis('Cost Type')
# fill result df with forumlae
df['Volume'] = g.loc['FOB', 'Volume']
df.loc['Freight', 'Cost'] = abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost']
df.loc['FOB', 'Cost'] = g.loc['FOB', 'Order Cost']
df.loc['Price', 'Cost'] = g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost']
return df
</code></pre>
</li>
<li><p>然后用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用<code>formulae_df</code>:</p>
<pre><code>df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()
# Group Month ID Cost Type Volume Cost
# 0 A 1/1/2021 SKU_1 Freight 75357 116570
# 1 A 1/1/2021 SKU_1 FOB 75357 12407112
# 2 A 1/1/2021 SKU_1 Price 75357 12458212
# 3 B 1/1/2021 SKU_1 Freight 931866 1378414
# 4 B 1/1/2021 SKU_1 FOB 931866 50059515
# 5 B 1/1/2021 SKU_1 Price 931866 62490987
</code></pre>
</li>
</ol>
<hr/>
<h3>带有<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a></h3>
<p>由于每个组都有一个<code>Volume</code>和多个<code>Cost</code>,因此我们可以将<code>Cost</code>生成为列表/元组和<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>它们:</p>
<ol>
<li><p>第一步仍然是将数字字符串转换为实际数字:</p>
<pre><code>df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
</code></pre>
</li>
<li><p>给定一个<code>Group</code>/<code>Month</code>/<code>ID</code>组,计算其<code>Volume</code>(值)和<code>Cost</code>(元组):</p>
<pre><code>def formulae_series(g):
# set index for easy loc access
g = g.set_index('Cost Type')
# compute formulae
volume = g.loc['FOB', 'Volume']
costs = {
'Freight': abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost'],
'FOB': g.loc['FOB', 'Order Cost'],
'Price': g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost'],
}
# return volume as a value and costs as tuples
return pd.Series({'Cost Type': costs.keys(), 'Volume': volume, 'Cost': costs.values()})
</code></pre>
</li>
<li><p>当我们用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用<code>formulae_series</code>时,请注意<code>Cost*</code>列如何包含元组:</p>
<pre><code>df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_series)
# Cost Type Volume Cost
# Group Month ID
# A 1/1/2021 SKU_1 (Freight, FOB, Price) 75357 (116570, 12407112, 12458212)
# B 1/1/2021 SKU_1 (Freight, FOB, Price) 931866 (1378414, 50059515, 62490987)
</code></pre>
</li>
<li><p>现在<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>将这些元组分成行:</p>
<pre><code>df_2 = df_2.explode(['Cost Type', 'Cost']).reset_index()
# Group Month ID Cost Type Volume Cost
# 0 A 1/1/2021 SKU_1 Freight 75357 116570
# 1 A 1/1/2021 SKU_1 FOB 75357 12407112
# 2 A 1/1/2021 SKU_1 Price 75357 12458212
# 3 B 1/1/2021 SKU_1 Freight 931866 1378414
# 4 B 1/1/2021 SKU_1 FOB 931866 50059515
# 5 B 1/1/2021 SKU_1 Price 931866 62490987
</code></pre>
</li>
</ol>
<hr/>
<h3>完整代码</h3>
<p>以下是重新组合的步骤(包括可选转换回逗号/美元):</p>
<pre><code>## load df_1
df_1 = pd.DataFrame([['A','1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],['A','1/1/2021','SKU_1','FOB','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Price','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],['B','1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],['B','1/1/2021','SKU_1','FOB','931,866','$50,059,515'],['B','1/1/2021','SKU_1','Price','931,866','$62,333,500'],['B','1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']],columns=['Group','Month','ID','Cost Type','Volume','Order Cost'])
## convert to numerics
df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
## dataframe option
df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()
## or apply formulae and explode costs
# df_2 = (df_1.groupby(['Group', 'Month', 'ID'])
# .apply(formulae_series)
# .explode(['Cost Type', 'Cost'])
# .reset_index())
## optional: revert to comma/dollar strings
df_2['Volume'] = df_2['Volume'].map('{:,}'.format)
df_2['Cost'] = df_2['Cost'].map('${:,}'.format)
</code></pre>
<p>输出:</p>
<pre class="lang-none prettyprint-override"><code> Group Month ID Cost Type Volume Cost
0 A 1/1/2021 SKU_1 Freight 75,357 $116,570
1 A 1/1/2021 SKU_1 FOB 75,357 $12,407,112
2 A 1/1/2021 SKU_1 Price 75,357 $12,458,212
3 B 1/1/2021 SKU_1 Freight 931,866 $1,378,414
4 B 1/1/2021 SKU_1 FOB 931,866 $50,059,515
5 B 1/1/2021 SKU_1 Price 931,866 $62,490,987
</code></pre>