全部
我有一个数据集,其定义如下:
eno|date|attendance
1|01-Jan-2010|P
1|02-Jan-2010|P
1|03-Jan-2010|A
1|04-Jan-2010|P
1|05-Jan-2010|P
2|01-Jan-2010|P
2|02-Jan-2010|P
2|03-Jan-2010|P
2|04-Jan-2010|A
2|05-Jan-2010|P
对于每个员工,要求创建一个“间隔组”,基本上按时间顺序对出勤值进行分组。组是将相似的出勤值分组到一起,直到看到新的出勤值。因此,预期输出为:
eno|date|attendance|attendanceGroup
1|01-Jan-2010|P|1
1|02-Jan-2010|P|1
1|03-Jan-2010|A|2
1|04-Jan-2010|P|3
1|05-Jan-2010|P|3
2|01-Jan-2010|P|1
2|02-Jan-2010|P|1
2|03-Jan-2010|P|1
2|04-Jan-2010|A|2
2|05-Jan-2010|P|3
到目前为止,我所能做的就是获取前一行的出勤值,但完全不知道如何从这里开始…提前多谢。。你知道吗
from datetime import datetime, timedelta
EmployeeAttendance = Row("eno", "date", "attendance")
EmpAttRowList = [EmployeeAttendance("1", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=94), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=93), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=90), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=95), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=90), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=89), "Y")
]
df = spark.createDataFrame(EmpAttRowList, EmployeeAttendance)
window = Window.partitionBy(df['eno']).orderBy("date")
previousrowattendance = lag(df["attendance"]).over(window)
考虑到您已经使用上述代码创建了数据帧,您可以使用下面的代码来获取attendanceGroup。让我知道它是否有效。你知道吗
你可以试试这个:
使用条件
attendance != lag(attendance)
创建grp
标志,以便于对标志求和创建一个由原始id
eno
和新创建的grp
标志列划分的新窗口,并应用sum
,基本上添加1以从1开始计数器。输出
相关问题 更多 >
编程相关推荐