在时间戳中计算特定帧内出现的次数

2024-04-27 18:52:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含伪数据的文件(txt),包含以下3列:

user_13 visit_19    1330760979

user_14 visit_20    1330732782

user_14 visit_21    1330769600

user_14 visit_22    1330783341

user_14 visit_23    1330796012

user_14 visit_24    1330797842

使用熊猫,我如何从第三栏的时间戳中找出16:00-22:00之间访问该页面的用户数量

一个用户可能有多次访问,为了符合列表的要求,第一次和最后一次访问必须在16:00和22:00之间


Tags: 文件数据用户txt列表时间页面visit
2条回答

我在下面发布的代码实现了您的目标。我还上传了完整的Jupyter笔记本here

import pandas as pd
from io import StringIO
import datetime

data = '''
user_13 visit_19    1330760979

user_14 visit_20    1330732782

user_14 visit_21    1330769600

user_14 visit_22    1330783341

user_14 visit_23    1330796012

user_14 visit_24    1330797842
'''

data = data.strip()      # remove trailing white-spaces and newline characters
data = io.StringIO(data) # create in-memory stream from the data string 
                         # variable for use in read_csv

df = pd.read_csv(
     data,                              # the data in "CSV" format
     delim_whitespace=True,             # override default delimiter "," with whitespace
     names=['user','visit','timestamp'] # define column names for the resulting dataframe
)

df['timestamp'] = pd.to_datetime(df['timestamp'], # convert "timestamp" to datetime 
                                 unit='s')        # with up to second precision


lb = 16 # lower hour bound
ub = 22 # upper hour bound

vb = df[
     (lb <= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
     &                               # perform logical and series operator on the two resulting series
     (ub >= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
]

​print(vb.user.drop_duplicates().value_counts())

user_14    1
Name: user, dtype: int64

正在初始化数据帧:

 df = pd.DataFrame({"user":["user1", "user2", "user3", "user3", "user1", "user1"],
                       "visit":["visit1", "visit2", "visit3", "visit4", "visit5", "visit6"],
                       "timestamp":[1330760979, 1330732782,1330769600, 1330783341, 1330796012,1330797842]})

将时间戳转换为日期时间,在小时之后:

df["datetime"]=pd.to_datetime(df["timestamp"], unit="s")
df["hour"]= df["datetime"].dt.hour

选择所需的daypart范围,然后将所选内容放在Dataframe上:

selected_range = (16,22)
selected_df = df[(df["hour"]>=selected_range[0]) & (df["hour"]<=selected_range[1])]

正在创建要聚合的辅助列:

selected_df["daypart"] = str(selected_range)

聚合用户和访问:

pd.pivot_table(selected_df, values=["user", "visit"], aggfunc={"user":[pd.Series.nunique, "count"], "visit":"count"}, columns="daypart")

输出:非唯一用户计数、唯一用户计数、访问计数

daypart        (16, 22)
user  count           2
      nunique         1
visit count           2

相关问题 更多 >