根据当前df中的列创建新df问题的回答

根据当前df中的列创建新df

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>我们可以用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用这些公式：</p> <ul> <li>每个组返回一个<code>Volume</code>/<code>Cost</code>数据帧</li> <li>或者返回一系列<code>Cost</code>元组和<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>元组</li> </ul> <hr/> <h3>数据帧选项</h3> <ol> <li><p>首先将数字字符串转换为实际数字（或者，如果正在使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" rel="nofollow noreferrer">^{<cd6>}</a>加载数据，请使用<code>thousands</code>参数）：</p> <pre><code>df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int) df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int) </code></pre> </li> <li><p>给定<code>Group</code>/<code>Month</code>/<code>ID</code>组，将其<code>Volume</code>和<code>Cost</code>作为数据帧返回：</p> <pre><code>def formulae_df(g): # set index to Cost Type for simpler indexing g = g.set_index('Cost Type') # initialize empty result df df = pd.DataFrame(columns=['Volume', 'Cost'], index=['Freight', 'FOB', 'Price']).rename_axis('Cost Type') # fill result df with forumlae df['Volume'] = g.loc['FOB', 'Volume'] df.loc['Freight', 'Cost'] = abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost'] df.loc['FOB', 'Cost'] = g.loc['FOB', 'Order Cost'] df.loc['Price', 'Cost'] = g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost'] return df </code></pre> </li> <li><p>然后用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用<code>formulae_df</code>：</p> <pre><code>df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index() # Group Month ID Cost Type Volume Cost # 0 A 1/1/2021 SKU_1 Freight 75357 116570 # 1 A 1/1/2021 SKU_1 FOB 75357 12407112 # 2 A 1/1/2021 SKU_1 Price 75357 12458212 # 3 B 1/1/2021 SKU_1 Freight 931866 1378414 # 4 B 1/1/2021 SKU_1 FOB 931866 50059515 # 5 B 1/1/2021 SKU_1 Price 931866 62490987 </code></pre> </li> </ol> <hr/> <h3>带有<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a></h3> <p>由于每个组都有一个<code>Volume</code>和多个<code>Cost</code>，因此我们可以将<code>Cost</code>生成为列表/元组和<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>它们：</p> <ol> <li><p>第一步仍然是将数字字符串转换为实际数字：</p> <pre><code>df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int) df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int) </code></pre> </li> <li><p>给定一个<code>Group</code>/<code>Month</code>/<code>ID</code>组，计算其<code>Volume</code>（值）和<code>Cost</code>（元组）：</p> <pre><code>def formulae_series(g): # set index for easy loc access g = g.set_index('Cost Type') # compute formulae volume = g.loc['FOB', 'Volume'] costs = { 'Freight': abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost'], 'FOB': g.loc['FOB', 'Order Cost'], 'Price': g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost'], } # return volume as a value and costs as tuples return pd.Series({'Cost Type': costs.keys(), 'Volume': volume, 'Cost': costs.values()}) </code></pre> </li> <li><p>当我们用<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>应用<code>formulae_series</code>时，请注意<code>Cost*</code>列如何包含元组：</p> <pre><code>df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_series) # Cost Type Volume Cost # Group Month ID # A 1/1/2021 SKU_1 (Freight, FOB, Price) 75357 (116570, 12407112, 12458212) # B 1/1/2021 SKU_1 (Freight, FOB, Price) 931866 (1378414, 50059515, 62490987) </code></pre> </li> <li><p>现在<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" rel="nofollow noreferrer">^{<cd5>}</a>将这些元组分成行：</p> <pre><code>df_2 = df_2.explode(['Cost Type', 'Cost']).reset_index() # Group Month ID Cost Type Volume Cost # 0 A 1/1/2021 SKU_1 Freight 75357 116570 # 1 A 1/1/2021 SKU_1 FOB 75357 12407112 # 2 A 1/1/2021 SKU_1 Price 75357 12458212 # 3 B 1/1/2021 SKU_1 Freight 931866 1378414 # 4 B 1/1/2021 SKU_1 FOB 931866 50059515 # 5 B 1/1/2021 SKU_1 Price 931866 62490987 </code></pre> </li> </ol> <hr/> <h3>完整代码</h3> <p>以下是重新组合的步骤（包括可选转换回逗号/美元）：</p> <pre><code>## load df_1 df_1 = pd.DataFrame([['A','1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],['A','1/1/2021','SKU_1','FOB','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Price','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],['B','1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],['B','1/1/2021','SKU_1','FOB','931,866','$50,059,515'],['B','1/1/2021','SKU_1','Price','931,866','$62,333,500'],['B','1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']],columns=['Group','Month','ID','Cost Type','Volume','Order Cost']) ## convert to numerics df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int) df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int) ## dataframe option df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index() ## or apply formulae and explode costs # df_2 = (df_1.groupby(['Group', 'Month', 'ID']) # .apply(formulae_series) # .explode(['Cost Type', 'Cost']) # .reset_index()) ## optional: revert to comma/dollar strings df_2['Volume'] = df_2['Volume'].map('{:,}'.format) df_2['Cost'] = df_2['Cost'].map('${:,}'.format) </code></pre> <p>输出：</p> <pre class="lang-none prettyprint-override"><code> Group Month ID Cost Type Volume Cost 0 A 1/1/2021 SKU_1 Freight 75,357 $116,570 1 A 1/1/2021 SKU_1 FOB 75,357 $12,407,112 2 A 1/1/2021 SKU_1 Price 75,357 $12,458,212 3 B 1/1/2021 SKU_1 Freight 931,866 $1,378,414 4 B 1/1/2021 SKU_1 FOB 931,866 $50,059,515 5 B 1/1/2021 SKU_1 Price 931,866 $62,490,987 </code></pre>

根据当前df中的列创建新df

1 个回答

相关Python问题