如何计算所有列中的唯一值,并在单独的数据框中显示它们的唯一名称?

2024-06-02 07:25:32 发布

您现在位置:Python中文网/ 问答频道 /正文

| 1st Most Common Value | 2nd Most Common Value | 3rd Most Common Value | 4th Most Common Value | 5th Most Common Value |
|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
| Grocery Store         | Pub                   | Coffee Shop           | Clothing Store        | Park                  |
| Pub                   | Grocery Store         | Clothing Store        | Park                  | Coffee Shop           |
| Hotel                 | Theatre               | Bookstore             | Plaza                 | Park                  |
| Supermarket           | Coffee Shop           | Pub                   | Park                  | Cafe                  |
| Pub                   | Supermarket           | Coffee Shop           | Cafe                  | Park                  |

数据帧的名称是df0。正如您所看到的,在所有列中都有许多重复的值。所以我想创建一个数据帧,它包含所有列中的唯一值及其频率。有人能帮我写代码吗?因为我想画一个条形图

输出应如下所示:

| Venues         | Count |
|----------------|-------|
| Bookstore      | 1     |
| Cafe           | 2     |
| Coffee Shop    | 4     |
| Clothing Store | 2     |
| Grocery Store  | 2     |
| Hotel          | 1     |
| Park           | 5     |
| Plaza          | 1     |
| Pub            | 4     |
| Supermarket    | 2     |
| Theatre        | 1     |

Tags: storeparkmostcafevaluecommonshophotel
3条回答

编辑:我在最初的回答中超越了自己(也感谢OP添加编辑/预期输出)。你想要this post,我认为最简单的答案是:

new_df = pd.DataFrame(df0.stack().value_counts())

如果您不关心值来自哪个列,而只需要它们的计数,那么在this post之后使用value_counts()(正如@Celius Stingher在评论中所说的)

如果确实要报告每列的每个值的频率,可以对每列使用value_counts(),但最终可能会出现不均匀的条目(要返回到DataFrame,可以执行某种join

相反,我创建了一个小函数来计算df中出现的值,并返回一个新值:

import pandas as pd
import numpy as np

def counted_entries(df, array):
    output = pd.DataFrame(columns=df.columns, index=array)
    for i in array:
        output.loc[i] = (df==i).sum()
    return output

这适用于填充了随机动物值名称的df。您只需通过获取其值的set来传递df中的唯一条目:

columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(5)]

df = pd.DataFrame(np.random.choice(['pig','cow','sheep','horse','dog'],size=(5,10)), columns=columns, index=index)

unique_vals = list(set(df.stack())) #this is all the possible entries in the df

df2 = counted_entries(df, unique_vals)

df之前:

      Column 1 Column 2 Column 3 Column 4  ... Column 7 Column 8 Column 9 Column 10
Row 1      pig      pig      cow      cow  ...      cow      pig      dog       pig
Row 2    sheep      cow      pig    sheep  ...      dog      pig      pig       cow
Row 3      cow      cow      cow    sheep  ...    horse      dog    sheep     sheep
Row 4    sheep      cow    sheep      cow  ...      cow    horse      pig       pig
Row 5      dog      pig    sheep    sheep  ...    sheep    sheep    horse     horse

counted_entries()的输出

       Column 1  Column 2  Column 3  ...  Column 8  Column 9  Column 10
pig           1         2         1  ...         2         2          2
horse         0         0         0  ...         1         1          1
sheep         2         0         2  ...         1         1          1
dog           1         0         0  ...         1         1          0
cow           1         3         2  ...         0         0          1

感谢您的编辑,也许这就是您想要的,使用value_counts作为完整的数据帧,然后聚合输出:

df0 = pd.DataFrame({'1st':['Grocery','Pub','Hotel','Supermarket','Pub'],
                    '2nd':['Pub','Grocery','Theatre','Coffee','Supermarket'],
                    '3rd':['Coffee','Clothing','Supermarket','Pub','Coffee'],
                    '4th':['Clothing','Park','Plaza','Park','Cafe'],
                    '5th':['Park','Coffee','Park','Cafe','Park']})

df1 = df0.apply(pd.Series.value_counts)
df1['Count'] = df1.sum(axis=1)
df1 = df1.reset_index().rename(columns={'index':'Venues'}).drop(columns=list(df0))
print(df1)

输出:

        Venues  Count
5         Park    5.0
2       Coffee    4.0
7          Pub    4.0
8  Supermarket    3.0
0         Cafe    2.0
1     Clothing    2.0
3      Grocery    2.0
4        Hotel    1.0
6        Plaza    1.0
9      Theatre    1.0

您也可以这样做:

df = pd.read_csv('test.csv', sep=',')
list_of_list = df.values.tolist()
t_list = sum(list_of_list, [])
df = pd.DataFrame(t_list)
df.columns = ['Columns']
df = df.groupby(by=['Columns'], as_index=False).size().to_frame().reset_index().rename(columns={0: 'Count'})
print(df)

           Columns  Count
0        Bookstore      1
1             Cafe      2
2   Clothing Store      2
3      Coffee Shop      4
4    Grocery Store      2
5            Hotel      1
6             Park      5
7            Plaza      1
8              Pub      4
9      Supermarket      2
10         Theatre      1

相关问题 更多 >