检测数据帧中某些列中的重复项并在这些列上执行操作

2024-04-20 15:21:58 发布

您现在位置:Python中文网/ 问答频道 /正文

下面是我想要的输入/输出。我有一些想法,但不完全确定。。你知道吗

How do i detect duplicates and then among them cross check if two columns have similar values?

所以我有一个这样的数据框。你知道吗

 No  fname        sname        landline        address     time_of_move_in
 1   Alphred      Thomas         123              A        19/10/2016,00:01:00
 2   Peter        Jay            345              B        29/10/2016,00:01:00
 3   Donald       Hook           123              A        30/10/2016,00:11:00
 4   Jay          Donald         345              B        29/10/2016,00:05:00
 5   Jay          Donald         123              A        30/10/2016,00:14:00
 6   Haskell      Peter          123              B        19/10/2016,00:01:00

我想要的是这样的输出

 Case_Number   fname    sname    landline   address   time_diff
      1        Peter     Jay       345         B       -4 Hours
      1        Jay       Donald    345         B       4 Hours
      2        Donald    Hook      123         A       -2 Hours
      2        Jay       Donald    123         A       2 Hours

最后,我只想过滤掉发现的两个时间差为<;3小时的任何情况。你知道吗

任何两个病例之间的标准

  1. 固定电话和地址应相同

  2. 如果上述为真,则在检测到的两行之间的fname或姓氏中必须重复相同的名称。(在上面的案例1中是Jay,在上面的案例2中是Donald。注意:如果Donald在fname中重复两次,则这是无效的情况)

  3. 两个小时之间的时差<;3小时,我想把时间的方向性也带到这里,因此在上面的输出中设置了负片。

注:我们不必以上述格式显示时差。只要是数字/时间格式就行


Tags: lttimeaddress情况hookfname案例peter
1条回答
网友
1楼 · 发布于 2024-04-20 15:21:58

您可以将timedelta转换为^{},因为使用timedelta < 0有点复杂:

df.time_of_move_in = pd.to_datetime(df.time_of_move_in, format='%d/%m/%Y,%H:%M:%S')
print (df)
   No    fname   sname  landline address     time_of_move_in
0   1  Alphred  Thomas       123       A 2016-10-19 00:01:00
1   2    Peter     Jay       345       B 2016-10-29 00:01:00
2   3   Donald    Hook       123       A 2016-10-30 00:11:00
3   4      Jay  Donald       345       B 2016-10-29 00:05:00
4   5      Jay  Donald       123       A 2016-10-30 00:14:00
5   6  Haskell   Peter       123       B 2016-10-19 00:01:00

def f(x):
    #convert 4 hours to seconds  
    hours4 = 4 * 60 * 60
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    #create unique values from x.name, insert as first column
    x1.insert(0,'Case_number', '{}{}'.format(*x.name))
    #get difference of datetimes, first value is NaN
    x1['time_diff'] = x1.time_of_move_in.diff().dt.total_seconds() 
    #get inverse difference, last value is NaN so filna NaN by value
    x1['time_diff']=x1['time_diff'].fillna(x1.time_of_move_in.diff(-1).dt.total_seconds())
    #boolean indexing
    x1 = x1[(x1['time_diff'] < hours4) & (x1['time_diff']  > -hours4)]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
#factorize values, add 1 for start from 1
df2.Case_number = pd.factorize(df2.Case_number)[0] + 1
df2.drop(['time_of_move_in', 'No'], axis=1, inplace=True)
print (df2)
   Case_number   fname   sname  landline address  time_diff
0            1  Donald    Hook       123       A     -180.0
1            1     Jay  Donald       123       A      180.0
2            2   Peter     Jay       345       B     -240.0
3            2     Jay  Donald       345       B      240.0

相关问题 更多 >