Pandas：按行展开dataframe，类似于R的SurvSplit（）

person_id tstart tend 1 0.00 1.00 2 0.00 1.00 2 1.00 2.00 2 2.00 2.34 3 0.00 1.00 3 1.00 2.00 3 2.00 6.85

1条回答

网友

1楼 · 发布于 2024-04-26 14:43:51

考虑以下定义的方法。虽然有点走查，但它没有使用循环，不像survsplit实际的源代码是用C编写的

下面基本上运行一个交叉连接迭代任期年到最大块arg，并合并到人的年。然后，在merge结果上串联具有计算的tstart和tend列的原始数据帧值。一个键必须分配给原始数据帧，这里是人：

from io import StringIO
import pandas as pd
import numpy as np

persons = pd.read_table(StringIO("""person_id  years                
1          1.00
2          2.34
3          6.85"""), sep="\s+").assign(key = 1)

def expand_tenure(chunk):
    newpersons = persons.assign(tstart = chunk, tend = persons['years'])
    newpersons.loc[newpersons['tend'] < chunk, 'tstart'] = np.floor(persons['years'])

    df = pd.DataFrame({'tstart': list(range(0, chunk)),
                       'tend': list(range(1, chunk+1)),
                       'key': 1})

    mdf = pd.merge(persons, df, on='key')    
    mdf = mdf[mdf['tend'] <= mdf['years']][['person_id', 'tstart', 'tend']]

    cdf = pd.concat([newpersons[['person_id', 'tstart', 'tend']], mdf])\
                    .sort_values(['person_id', 'tstart'])\
                    .drop_duplicates(['person_id', 'tend']).reset_index(drop=True)

    return cdf

输出（三次运行）

print(expand_tenure(1))
#    person_id  tstart  tend
# 0          1     0.0  1.00
# 1          2     0.0  1.00
# 2          2     1.0  2.34
# 3          3     0.0  1.00
# 4          3     1.0  6.85

print(expand_tenure(4))
#    person_id  tstart  tend
# 0          1     0.0  1.00
# 1          2     0.0  1.00
# 2          2     1.0  2.00
# 3          2     2.0  2.34
# 4          3     0.0  1.00
# 5          3     1.0  2.00
# 6          3     2.0  3.00
# 7          3     3.0  4.00
# 8          3     4.0  6.85

print(expand_tenure(12))
#     person_id  tstart  tend
# 0           1     0.0  1.00
# 1           2     0.0  1.00
# 2           2     1.0  2.00
# 3           2     2.0  2.34
# 4           3     0.0  1.00
# 5           3     1.0  2.00
# 6           3     2.0  3.00
# 7           3     3.0  4.00
# 8           3     4.0  5.00
# 9           3     5.0  6.00
# 10          3     6.0  6.85

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pandas：按行展开dataframe，类似于R的SurvSplit（）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >