通过重复日期+唯一ID在Pandas中进行Intervalindex合并

2024-04-19 06:54:53 发布

您现在位置:Python中文网/ 问答频道 /正文

第一个问题是,仍然在学习python&;熊猫

编辑:我已经成功地将DF值从长到宽旋转,以便具有唯一的id+日期索引(例如,没有uniqueID每天超过1行)。然而,我仍然没有达到我想要的结果

我有两个DFs,我想根据a)uniqueID和B)如果该uniqueID是在不同和多个日期范围内考虑的,则将其合并。我发现this question接近我所寻找的东西;然而,在解决方案不可行并且挖掘了一点之后,似乎我正在尝试的是不可能的due to dates overlap(?)

其要点是:如果uniqueID在df_dates_范围内,且其对应的day列在从dates_范围开始的开始:结束范围内,则将df_值上的所有值相加

每个DFs中都有更多的列,但这些是相关的列。在任何地方都暗示重复,并且没有特定的顺序。所有DF系列的格式都适当

这里是df1,日期范围:

import pandas as pd
import numpy as np

dates_range = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
               "start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
               "end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
               "df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"]}

df_dates_range = pd.DataFrame(dates_range, 
                              columns = ["uniqueID", 
                                         "start", 
                                         "end", 
                                         "df1_tag1"])

df_dates_range[["start","end"]] = df_dates_range[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)

和df2,值:

values = {"uniqueID": [1, 2, 7, 3, 4, 4, 10, 1, 8, 7, 10, 9, 10, 8, 3, 10, 11, 3, 7, 4, 10, 14], 
          "df2_tag1": ["abc", "abc", "abc", "abc", "abc", "def", "abc", "abc", "abc", "abc", "abc", "abc", "def", "def", "abc", "abc", "abc", "def", "abc", "abc", "def", "abc"], 
          "df2_tag2": ["type 1", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 2", "type 1", "type 1", "type 1"], 
          "day": ["01/01/2020", "01/02/2020", "01/03/2020", "01/03/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/01/2020", "02/02/2020", "02/03/2020", "02/03/2020", "02/04/2020", "02/05/2020", "02/05/2020", "03/03/2020", "03/04/2020", "03/04/2020", "03/06/2020", "03/06/2020", "03/07/2020", "03/06/2020", "04/08/2020"],
          "df2_value1": [2, 10, 6, 5, 7, 9, 3, 10, 9, 7, 4, 9, 1, 8, 7, 5, 4, 4, 2, 8, 8, 4], 
          "df2_value2": [1, 5, 10, 13, 15, 10, 12, 50, 3, 10, 2, 1, 4, 6, 80, 45, 3, 30, 20, 7.5, 15, 3], 
          "df2_value3": [0.547, 2.160, 0.004, 9.202, 7.518, 1.076, 1.139, 25.375, 0.537, 7.996, 1.475, 0.319, 1.118, 2.927, 7.820, 19.755, 2.529, 2.680, 17.762, 0.814, 1.201, 2.712]}

values["day"] = pd.to_datetime(values["day"], format = "%m/%d/%Y")

df_values = pd.DataFrame(values, 
                         columns = ["uniqueID", 
                                    "df2_tag1", 
                                    "df2_tag2", 
                                    "day", 
                                    "df2_value1", 
                                    "df2_value2", 
                                    "df2_value1"])

从第一个链接开始,我尝试运行以下内容:

df_dates_range.index = pd.IntervalIndex.from_arrays(df_dates_range["start"], 
                                                        df_dates_range["end"], 
                                                        closed = "both")

df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))

df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

然而,我得到了这个错误。n00b已检查,删除了第二天到最后一天的指数,问题仍然存在:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-54ea384e06f7> in <module>
     14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
     15 
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

C:\anaconda\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3846             else:
   3847                 values = self.astype(object).values
-> 3848                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3849 
   3850         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-58-54ea384e06f7> in <lambda>(x)
     14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
     15 
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
   4471     @Appender(_index_shared_docs["get_indexer_non_unique"] % _index_doc_kwargs)
   4472     def get_indexer_non_unique(self, target):
-> 4473         target = ensure_index(target)
   4474         pself, ptarget = self._maybe_promote(target)
   4475         if pself is not self or ptarget is not target:

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in ensure_index(index_like, copy)
   5355             index_like = copy(index_like)
   5356 
-> 5357     return Index(index_like)
   5358 
   5359 

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    420             return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
    421         elif data is None or is_scalar(data):
--> 422             raise cls._scalar_data_error(data)
    423         else:
    424             if tupleize_cols and is_list_like(data):

TypeError: Index(...) must be called with a collection of some kind, Timestamp('2020-01-01 00:00:00') was passed

预期结果将是:

desired = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
           "start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
           "end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
           "df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"],
           "df2_tag1": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"],
           "df2_value1": [2, 10, 5, 16, 10, 7, 5, np.nan, 11, 8, 2, 8], 
           "df2_value2+df2_value3": [1.547, 7.160, 22.202, 33.595, 75.375, 17.996, 8.594,  np.nan, 120.501, 8.314, 37.762, 16.201], 
           "df2_tag3": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"]}


df_desired = pd.DataFrame(desired, 
                          columns = ["uniqueID", 
                                     "start", 
                                     "end", 
                                     "df1_tag1", 
                                     "df2_tag1", 
                                     "df2_value1", 
                                     "df2_value2+df2_value3", 
                                     "df2_tag3"])

df_desired[["start","end"]] = df_desired[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)

或在图形可视化中:

enter image description here

请注意S&;第10行的T为NaN,因为uniqueID 11在v2期间没有“活动”;然而,如果可能的话,我希望能够从df2中提取标签;他们100%都在那里,只是可能不是那个时期,可能是第二个脚本的任务?另外,请注意colt是cols J+K的集合

编辑:忘了提到我以前曾试图在this question上使用@firelynx的解决方案来完成这项工作,但尽管我有32gb的ram,我的机器还是无法处理。SQL解决方案对我不起作用,因为有一些sqlite3库问题


Tags: dfdateindextyperangestartv2end
2条回答

终于破解了这个

由于IntervalIndex只能处理唯一的日期,我所做的是将这些唯一的开始:结束间隔及其唯一标记映射到df_值上。我犯的错误是使用整个df_dates_范围作为intervalindex数组值,所以这只是提取unique的问题。我不清楚的一件事是,当/如果任何间隔范围都有多个适用的df1_tag1值时,会发生什么情况,希望它只创建一个标记列表,不管怎样都能工作

请记住,在执行以下操作之前,我需要将df_值从长格式转换为宽格式,因此我在生成重复的uniqueID+值行的级别上使用了group_by。出于某种原因,我无法在这里使用示例数据,但在任何情况下,如果您的数据采用所需的格式(宽/长),则以下操作都应该有效,以避免df_值具有重复的uniqueID+day行

之后,我做了以下工作:

# In order to bypass the non "uniqueness" of the desired tag to apply, we create a list of the unique df1_tag1's with their respective start:end dates

df1_tag1_list = df_dates_range.groupby(["start", 
                                        "end", 
                                        "df1_tag1"]).size().reset_index().rename(columns={0:'records'})

那么

# Create a new pandas IntervalIndex series variable to then map ("paste onto") the copied df_values using 
applicable_df1_tag1 = pd.Series(df1_tag1_list["df1_tag1"].values, 
                                pd.IntervalIndex.from_arrays(df1_tag1_list['start'], 
                                                             df1_tag1_list['end']))

# map the df1_tag1 to the applicable rows in the copied df_values
df_values_with_df1_tag1["applicable_df1_tag1"] = df_values_with_df1_tag1["day"].map(applicable_df1_tag1)

由此产生的结果应该是聚合的df_值——或者在groupby期间使用的任何其他数学函数——以及非重复的uniqueID+day行,这些行现在具有映射的df1_tag1,然后我们可以使用它与uniqueID合并到df_dates_范围

希望这是一个对某些人有效的答案:)

编辑:也可能很重要,当我进行左合并时,为了避免不必要的重复,我使用了以下命令

df_date_ranges_all = df_date_ranges.merge(df_values_wide_with_df1_tag1.drop_duplicates(subset = ['uniqueID'], 
                                                                                               keep = "last"), 
                                            how = "left", 
                                            left_on = ["uniqueID", "df1_tag1"], 
                                            right_on = ["uniqueID", "applicable_df1_tag1"],
                                            indicator = True)

在这些情况下,最简单的事情(如果您在硬件方面负担得起的话)是创建一个临时数据帧,然后进行聚合。这具有将合并与聚合分离的巨大优势,并极大地降低了复杂性

In [22]: df = pd.merge(df_dates_range, df_values)                                                                                                                                                                
Out[22]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01       1  Online
1          1 2019-12-31 2020-01-04 2020-02-01      50  Online
2          1 2020-02-01 2020-02-05 2020-01-01       1  Online
3          1 2020-02-01 2020-02-05 2020-02-01      50  Online
4          2 2019-12-31 2020-01-04 2020-01-02       5    Shop
..       ...        ...        ...        ...     ...     ...
23        10 2020-02-01 2020-02-05 2020-03-04      45    Shop
24        10 2020-03-03 2020-03-08 2020-01-03      13    Shop
25        10 2020-03-03 2020-03-08 2020-02-03       2  Online
26        10 2020-03-03 2020-03-08 2020-03-04      45    Shop
27        11 2020-02-01 2020-02-05 2020-02-05       4    Shop

In [24]: df = df[(df['day'] > df['start']) & (df['day'] <= df['end'])]                                                                                                                                           
Out[24]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01       1  Online
4          2 2019-12-31 2020-01-04 2020-01-02       5    Shop
5          3 2019-12-31 2020-01-04 2020-01-04      12    Shop
10         3 2020-03-03 2020-03-08 2020-03-06      30  Online
11         4 2019-12-31 2020-01-04 2020-01-04      15  Online
12         4 2019-12-31 2020-01-04 2020-01-04      10    Shop
16         7 2020-02-01 2020-02-05 2020-02-03      10    Shop
20         7 2020-03-03 2020-03-08 2020-03-06      20    Shop
22        10 2020-02-01 2020-02-05 2020-02-03       2  Online
26        10 2020-03-03 2020-03-08 2020-03-04      45    Shop
27        11 2020-02-01 2020-02-05 2020-02-05       4    Shop

然后你可以做类似的事情

In [30]: df.groupby(['start', 'end', 'uniqueID', 'medium'])['value1'].agg(['count', 'sum']).reset_index()                                                                                                   
Out[30]: 
        start        end  uniqueID  medium  count  sum
0  2019-12-31 2020-01-04         1  Online      1    1
1  2019-12-31 2020-01-04         2    Shop      1    5
2  2019-12-31 2020-01-04         3    Shop      1   12
3  2019-12-31 2020-01-04         4  Online      1   15
4  2019-12-31 2020-01-04         4    Shop      1   10
5  2020-02-01 2020-02-05         7    Shop      1   10
6  2020-02-01 2020-02-05        10  Online      1    2
7  2020-02-01 2020-02-05        11    Shop      1    4
8  2020-03-03 2020-03-08         3  Online      1   30
9  2020-03-03 2020-03-08         7    Shop      1   20
10 2020-03-03 2020-03-08        10    Shop      1   45

以所需的形式聚合数据。然而,我没有得到你期望的结果。在这些值中有带Shop的行,并且一些日期稍早。我责备初始值;)希望这能把你推向正确的方向

注意:如果您只对间隔的第一个或最后一个值感兴趣,pd.merge_asof是一个有趣的选择

In [17]: pd.merge_asof(df_dates_range, df_values, left_on='start', right_on='day', by='uniqueID', direction='forward')                                                                                      
Out[17]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01     1.0  Online
1          2 2019-12-31 2020-01-04 2020-01-02     5.0    Shop
2          3 2019-12-31 2020-01-04 2020-01-04    12.0    Shop
3          4 2019-12-31 2020-01-04 2020-01-04    15.0  Online
4          1 2020-02-01 2020-02-05 2020-02-01    50.0  Online
5          7 2020-02-01 2020-02-05 2020-02-03    10.0    Shop
6         10 2020-02-01 2020-02-05 2020-02-03     2.0  Online
7         11 2020-02-01 2020-02-05 2020-02-05     4.0    Shop
8          3 2020-03-03 2020-03-08 2020-03-03    80.0  Online
9          4 2020-03-03 2020-03-08        NaT     NaN     NaN
10         7 2020-03-03 2020-03-08 2020-03-06    20.0    Shop
11        10 2020-03-03 2020-03-08 2020-03-04    45.0    Shop

然而,实际上不可能将聚合压缩到这个过程中

相关问题 更多 >