在Postgres中识别重复的时间序列序列

2 投票

2 回答

916 浏览

提问于 2025-04-27 13:28

我有一个时间序列表（在Postgres数据库中），里面有以下几列：

item_id,  country_id,  year,  month, value

在这个表里，有一些重复的时间序列：它们有相同的国家ID和时间序列的日期/值，但被分配了不同的项目ID，比如：'红苹果'和'苹果，红色'

我想知道怎么能找到这些重复的时间序列？我希望（国家ID，年份，月份，值）在所有日期中都能匹配这个项目。

我还是个初学者，所以如果我遗漏了什么细节，请多多包涵。我主要想了解概念上的方法——我可以在Postgres或python/Pandas中实现。

举个例子，我想能够检测到这样的情况：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

我希望输出的结果看起来像这样：

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

这样的结果也可以：

item_id1,     item_id2,      country_id,     year,     time_month,   value
--------------------------------------------------------------------------
Red Apples    Apples, Red         5          1996         1           300
Red Apples    Apples, Red         5          1996         2           500
Red Apples    Apples, Red         5          1996         3           370

我想过尝试这样的做法：

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

然后我会检查确保每对项目ID的所有日期/值都出现过。但如果可能的话，我想一次性检查所有的日期/值。

我不确定使用表连接是否合适……？

暂无标签

2 个回答

这个查询的意思是：从我的表（my_table）中选择所有的记录。
然后按照国家ID（country_id）、年份（year）、月份（month）和某个值（value）来分组。
最后，筛选出那些在同一个组里，项目ID（item_id）的数量大于1的记录。

！这个查询没有经过测试！

回答于 2025-04-27 由 Python大师

分享举报

_{请查看下面的更新！}

如果你能提供更多关于样本数据和期望结果的细节，我觉得下面的查询可能会对你有帮助：

SELECT country_id,  year,  month, value
  FROM a_table
 GROUP BY country_id,  year,  month, value
HAVING count(*) > 1;

这个查询会显示所有除了item_id之外都相等的条目。如果你想找到所有与重复组对应的行，可以使用这个查询：

SELECT item_id, country_id,  year,  month, value
  FROM a_table
 WHERE (country_id,  year,  month, value)
    IN (
    SELECT country_id,  year,  month, value
      FROM a_table
     GROUP BY country_id,  year,  month, value
    HAVING count(*) > 1)
 ORDER BY country_id,  year,  month, value, item_id;

我把item_id放在排序的最后，这样更容易识别重复项。你可以根据需要进行调整。这个查询可能会花一些时间，具体取决于你的数据量。

为了避免将来出现类似的情况（重复数据），你可能想要创建一个唯一约束，像这样：

ALTER TABLE a_table ADD CONSTRAIN u_cymv
    UNIQUE (country_id,  year,  month, value);

编辑： 在添加了评论后，我想出了以下查询来查找重复系列：

WITH a_table(item_id,country_id,year,month,value) AS (VALUES
    ('Red Apples'::text,5,1996,1,300::numeric),
    ('Red Apples',5,1996,2,500),
    ('Red Apples',5,1996,3,370),
    ('Apples, Red',5,1996,1,300),
    ('Apples, Red',5,1996,2,500),
    ('Apples, Red',5,1996,3,370)
), dups AS (
    SELECT string_agg(item_id,'/') AS items,
           country_id,value,
           daterange(to_date(year::text||month,'YYYYMM'),
                     (to_date(year::text||month,'YYYYMM')
                      +INTERVAL'1mon')::date,'[)') AS range
      FROM a_table
     GROUP BY country_id,year,month,value
    HAVING count(*) > 1
)
SELECT grp,count(*),items,country_id,
       daterange(min(lower(range)), max(upper(range)), '[)') r,
       array_agg(value)
  FROM ( 
    SELECT items,country_id,range,value,
           sum(g) OVER (ORDER BY country_id, range) grp
      FROM (
        SELECT items,country_id,
               range,value,
               CASE WHEN lag(range) OVER (PARTITION BY country_id
                                          ORDER BY range) -|- range
                    THEN NULL ELSE 1 END g
          FROM dups) s
    ) s
 GROUP BY grp,country_id,items
HAVING count(*) >= 3
 ORDER BY country_id,r,items;

这个查询的作用：

a_table 是提供的样本数据的副本；
dups 是用来查找重复记录的。我还把year, month 列转换成daterange，因为我找不到其他合适的方法来正确查找跨越纽约的系列；
在标出重复项后，我比较前一个range（在country_id内）和当前的range，如果它们不是相邻的，就设置组标志g；
接下来，我使用sum()函数的运行总和效果来创建组标识符grp。对于样本数据，这只产生了一个组；
最后，我使用grp进行GROUP BY，将数据分组为系列。我还将country_id和items包含在GROUP BY中，但这只是为了避免将它们包裹在聚合函数中——它们在每个grp中都是唯一的。我还形成了一个新的daterange列，这是因为range类型没有内置的聚合函数。

在执行这个查询之前，你可能需要将work_mem增加到1GB，具体取决于你实际表中的行数。请尝试一下，告诉我是否有效。如果可以的话，分享一下这个查询的EXPLAIN (analyze, buffers)结果会很好。

回答于 2025-04-27 由 Python大师

分享举报

在Postgres中识别重复的时间序列序列

2 个回答

撰写回答