查找特定类别的移动时间段的数据帧长度

2024-06-06 06:38:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧:

测向:

    Name    Date_1  Date_2
0   Alan    2013-06-21  2013-06-26
1   Bob     2011-01-29  2011-02-01
2   Chris   2010-11-15  2010-11-17
3   Bob     2016-03-14  2016-03-16
4   Doug    2011-03-07  2011-03-10
5   Elijah  2011-02-24  2011-03-01
6   Bob     2011-01-03  2011-01-13
7   Bob     2011-02-07  2011-02-25
8   Frank   2014-07-21  2014-07-23
9   Chris   2011-02-18  2011-02-22
10  Doug    2010-09-13  2010-09-17
11  Chris   2011-01-15  2011-01-19
12  George  2010-06-29  2010-06-30

和df1:

        Date        Name    Period
12971   2015-08-18  Alan    2015-08-16
12972   2015-08-19  Alan    2015-08-17
12973   2015-08-20  Alan    2015-08-18
12974   2015-08-21  Alan    2015-08-19
12975   2015-08-22  Alan    2015-08-20
12976   2015-08-23  Alan    2015-08-21
12977   2015-08-24  Alan    2015-08-22
12978   2015-08-25  Alan    2015-08-23
12979   2015-08-26  Alan    2015-08-24
12980   2015-08-27  Alan    2015-08-25
12981   2015-08-28  Alan    2015-08-26
12982   2015-08-29  Alan    2015-08-27

两个数据帧都有上千行长,这只是一个示例,我想做的是找到df中的行数,其中Date小于Date_2,Date_1大于df中特定名称的句点

我做了以下工作,虽然取得了效果,但速度很慢:

df1['Volume'] = df1.apply(lambda x: len(df[(df['Name'] == x['Name']) & (
df['Date_1'] < x['Period']) & (df['Date_2'] > x['Date'])]), axis=1)

如果你有什么建议,请告诉我


Tags: 数据frankname示例dfdatechrisperiod
1条回答
网友
1楼 · 发布于 2024-06-06 06:38:15

我不得不稍微更改您提供的表,因为您当前的版本没有任何重叠。我假设您要执行以下操作:

模式很简单,首先对两个表进行外部联接,然后对结果进行透视,最后对感兴趣的数据帧进行左联接。这应该比你的方法快,但是,它可能更占用内存

解析数据

第一件事就是对您提供的数据执行一些基本的解析,并将其放入dataframe中(您可能可以跳过这个,我提供这个是为了跟踪):

import pandas as pd
from io import StringIO
import re

# First set tables
table = """    Name    Date_1  Date_2
0   Alan    2013-06-21  2013-06-26
1   Bob     2011-01-29  2011-02-01
2   Chris   2010-11-15  2010-11-17
3   Bob     2016-03-14  2016-03-16
4   Doug    2011-03-07  2011-03-10
5   Elijah  2011-02-24  2011-03-01
6   Bob     2011-01-03  2011-01-13
7   Bob     2011-02-07  2011-02-25
8   Frank   2014-07-21  2014-07-23
9   Chris   2011-02-18  2011-02-22
10  Doug    2010-09-13  2010-09-17
11  Chris   2011-01-15  2011-01-19
12  George  2010-06-29  2010-06-30"""

table2 = """        Date        Name    Period
12971   2015-08-18  Alan    2015-08-16
12972   2015-08-19  Alan    2015-08-17
12973   2015-08-20  Alan    2015-08-18
12974   2015-08-21  Alan    2015-08-19
12975   2015-08-22  Alan    2015-08-20
12976   2015-08-23  Alan    2015-08-21
12977   2015-08-24  Alan    2015-08-22
12978   2015-08-25  Alan    2015-08-23
12979   2015-08-26  Alan    2015-08-24
12980   2015-08-27  Alan    2015-08-25
12981   2015-08-28  Alan    2015-08-26
12982   2015-08-29  Alan    2015-08-27
12983   2013-06-24  Alan    2013-06-25"""

# Prepare tables in format that makes date lookups easier
series = pd.read_csv(StringIO(table))['    Name    Date_1  Date_2'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df = pd.DataFrame(series.values.tolist(), columns = ["index", "Name", "Date_1","Date_2"])
df["Date_1"] = pd.to_datetime(df["Date_1"])
df["Date_2"] = pd.to_datetime(df["Date_2"])

series = pd.read_csv(StringIO(table2))['        Date        Name    Period'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df1 = pd.DataFrame(series.values.tolist(), columns = ["index", "Date", "Name","Period"])
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Period"] = pd.to_datetime(df1["Period"])

外部连接

外联接在Name列上很简单:

outer = pd.merge(df1,df, on="Name",how="outer")

在df1上合并(假定需求)

只需首先通过NameDatePeriod对数据进行透视,然后对这些数据进行计数。然后reset_index并与原始表合并,当找不到查找时,我假定0

# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name","Date","Period"],
                                                                                    values= ["Date_1"],
                                                                                    aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Date","Period","Volume"]]

pd.merge(df1,pivot, how = "left", on=["Name","Date","Period"]).fillna(0)

在df上合并(为了完整性)

# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name"],
                                                                                    values= ["Date"],
                                                                                    aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Volume"]]

pd.merge(df,pivot, how = "left", on="Name").fillna(0)

相关问题 更多 >