遍历数据帧以查找str值在元组数组中出现的次数

2024-05-14 19:54:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个dataframe,其中一列只包含字符串元组,索引是一个月的某一天和一周的某一天。你知道吗

所以:

    all_app_id  day_of_week dayofmonth
0   (g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx...   Fri 29
1   (awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8...   Sat 30
2   (kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz...   Sun 31
3   (ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9...   Mon 1
4   (dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8...   Tues    2
5   (ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu...   Weds    3

以及唯一应用程序ID的列表:

array(['g8m4lecv', 'uyhsx6lo', 'kz8udlea', ..., 'nrow38d5', 'dekfx963',
       'ybkne8eo'], dtype=object)

我要计算的基本上是,对于唯一应用程序id列表中的每个app_id,它们出现的次数都按day_of_weekdayofmonth分组,因此对于每个app_id,我可以计算它们最常见的day_of_weekdayofmonth,总的来说,我还可以通过对生成的数据帧进行操作来计算最常见的日期和月份。你知道吗

示例输出(在此之后应该很简单)是:

星期几

           Mon   Tues   Weds  Thurs  Fri 
g8m4lecv   34      53     84      97     20
uyhsx6lo   3423    5443   235     33     54
kz8udlea   121     1212   39      93     756

。。你知道吗

月日

           1       2      3       4      5 ...
g8m4lecv   3       5     8        7     20
uyhsx6lo   12      12    23       233   54
kz8udlea   31      212   39       93     2

。。。你知道吗

我一直在尝试使用类似答案中的groupbyfor app_id in unique_app_idsstr.contains(app_id).sum(),但我似乎无法理解如何构造这个和如何组织我的循环。我也在研究list(zip())如何组合元组,但同样不确定如何在没有分组的情况下按星期几组合。你知道吗


Tags: ofidapp元组weekdaymonfri
1条回答
网友
1楼 · 发布于 2024-05-14 19:54:39
df = pd.DataFrame(
    {'all_app_id': ['g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx',
                    'awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8',
                    'kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz',
                    'ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9',
                    'dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8',
                    'ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu'],
     'day_of_week': ['Fri', 'Sat', 'Sun', 'Mon', 'Tues', 'Weds'],
     'dayofmonth': [29, 30, 31, 1, 2, 3]})

>>> (pd.melt(
         pd.concat([df.all_app_id.str.split(',', expand=True), 
                    df[['day_of_week', 'dayofmonth']]], 
                   axis=1), 
         id_vars=['day_of_week', 'dayofmonth'], 
         value_name='app_id')
     .drop('variable', axis=1)
     .pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
     .fillna(0)
     )
                  dayofmonth                      
day_of_week              Fri Mon Sat Sun Tues Weds
app_id                                            
 3c83655d34c365c9          0   1   0   0    0    0
 6f91a1e2                  0   0   0   1    0    0
 e0ff8                     0   0   0   0    1    0
 ecur4180                  0   0   0   0    1    0
 jfigr5ko                  0   0   0   0    0    1
 kz8udlea                  1   0   0   0    0    1
 lfqlz                     0   0   0   1    0    0
 mv4uo5xy                  0   0   0   2    0    0
 ohafu                     0   0   0   0    0    1
 ohafu6x5                  0   0   0   0    0    1
 t6rf8                     0   0   1   0    0    0
 t6rf8pr                   0   1   0   0    0    0
 twidi4l3                  0   0   0   0    2    0
 uyhsx                     1   0   0   0    0    0
 uyhsx6lo                  2   0   0   0    0    0
 yg02r5dz                  0   0   3   0    0    0
 ze4rr0vi                  0   1   0   0    0    0
awkcmpns                   0   0   1   0    0    0
dflewd0u                   0   0   0   0    1    0
g8m4lecv                   1   0   0   0    0    0
kz8udlea                   0   0   0   1    0    0
ohafu6x5                   0   0   0   0    0    1
ze4rr0vi                   0   1   0   0    0    0

首先需要将all_app_id列拆分为单独的列:

>>> df.all_app_id.str.split(',', expand=True)
          0          1          2                  3       4
0  g8m4lecv   uyhsx6lo   kz8udlea           uyhsx6lo   uyhsx
1  awkcmpns   yg02r5dz   yg02r5dz           yg02r5dz   t6rf8
2  kz8udlea   6f91a1e2   mv4uo5xy           mv4uo5xy   lfqlz
3  ze4rr0vi    t6rf8pr   ze4rr0vi   3c83655d34c365c9    None
4  dflewd0u   twidi4l3   twidi4l3           ecur4180   e0ff8
5  ohafu6x5   kz8udlea   ohafu6x5           jfigr5ko   ohafu

然后将dayofmonth和day\u of \u week列粘回:

df2 = pd.concat([df.all_app_id.str.split(',', expand=True), 
                 df[['day_of_week', 'dayofmonth']]], axis=1)
>>> df2
          0          1          2                  3       4 day_of_week  dayofmonth
0  g8m4lecv   uyhsx6lo   kz8udlea           uyhsx6lo   uyhsx         Fri          29
1  awkcmpns   yg02r5dz   yg02r5dz           yg02r5dz   t6rf8         Sat          30
2  kz8udlea   6f91a1e2   mv4uo5xy           mv4uo5xy   lfqlz         Sun          31
3  ze4rr0vi    t6rf8pr   ze4rr0vi   3c83655d34c365c9    None         Mon           1
4  dflewd0u   twidi4l3   twidi4l3           ecur4180   e0ff8        Tues           2
5  ohafu6x5   kz8udlea   ohafu6x5           jfigr5ko   ohafu        Weds           3

然后融化结果:

>>> df3 = pd.melt(df2, id_vars=['day_of_week', 'dayofmonth'], value_name='app_id')
>>> df3
   day_of_week  dayofmonth variable             app_id
0          Fri          29        0           g8m4lecv
1          Sat          30        0           awkcmpns
2          Sun          31        0           kz8udlea
...
28        Tues           2        4              e0ff8
29        Weds           3        4              ohafu

删除variable列:

df3 = df3.drop('variable', axis=1)

然后旋转结果:

df4 = df3.pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
>>> df4
                  dayofmonth                      
day_of_week              Fri Mon Sat Sun Tues Weds
app_id                                            
 3c83655d34c365c9        NaN   1 NaN NaN  NaN  NaN
 6f91a1e2                NaN NaN NaN   1  NaN  NaN
 e0ff8                   NaN NaN NaN NaN    1  NaN
 ecur4180                NaN NaN NaN NaN    1  NaN
...

然后用零填充NaN值。你知道吗

df4 = df4.fillna(0)

此时,还可以从顶列中删除dayofmonth

df4.columns = df4.columns.droplevel(0)

瞧!你知道吗

>>> df4.head()
day_of_week        Fri  Mon  Sat  Sun  Tues  Weds
app_id                                           
 3c83655d34c365c9    0    1    0    0     0     0
 6f91a1e2            0    0    0    1     0     0
 e0ff8               0    0    0    0     1     0
 ecur4180            0    0    0    0     1     0
 jfigr5ko            0    0    0    0     0     1

相关问题 更多 >

    热门问题