如何在没有拆分器的情况下拆分等长字符串并展开数据帧

2024-04-16 12:35:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在不使用拆分器的情况下拆分等长字符串并展开数据帧。你知道吗

下面是我正在使用的测试数据帧:

sample1 = pd.DataFrame({
        'TST': {1: 1535840000000, 2: 1535840000000}, 
        'RCV': {1: 1535840000000, 2: 1535850000000}, 
        'TCU': {1: 358272000000000, 2: 358272000000000}, 
        'SPD': {1: '0', 2: '00000000000000710000007D007C00E2'}
        })

如您所见,SPD列包含各种长度的字符串,没有任何拆分器。你知道吗

我想将每4个字符的SPD列拆分为新行,然后将它们扩展到dataframe。你知道吗

             TST            RCV              TCU   SPD
0  1535840000000  1535840000000  358272000000000  0000
1  1535840000000  1535840000000  358272000000000  0000
2  1535840000000  1535840000000  358272000000000  0000
3  1535840000000  1535840000000  358272000000000  0071
4  1535840000000  1535840000000  358272000000000  0000
5  1535840000000  1535840000000  358272000000000  007D
6  1535840000000  1535840000000  358272000000000  007C
7  1535840000000  1535840000000  358272000000000  00E2

我试着先用这个生成一个序列:

pd.concat([pd.Series(re.findall('....', row['SPD'])) for _, row in sample1.iterrows()]).reset_index()

这给了

   index     0
0      0  0000
1      1  0000
2      2  0000
3      3  0071
4      4  0000
5      5  007D
6      6  007C
7      7  00E2

但我无法将它扩展回sample1


Tags: 数据字符串dataframeindex情况rowpd测试数据
6条回答

您可以使用str.findall,然后使用repeat基于SPD中的4个字符片的数量的行。你知道吗

from itertools import chain

spd4 = df.pop('SPD').str.findall(r'.{4}') 

(pd.DataFrame(df.values.repeat(spd4.str.len(), axis=0), columns=df.columns)
   .assign(SPD=list(chain.from_iterable(spd4))))

             TST            RCV              TCU   SPD
0  1535840000000  1535850000000  358272000000000  0000
1  1535840000000  1535850000000  358272000000000  0000
2  1535840000000  1535850000000  358272000000000  0000
3  1535840000000  1535850000000  358272000000000  0071
4  1535840000000  1535850000000  358272000000000  0000
5  1535840000000  1535850000000  358272000000000  007D
6  1535840000000  1535850000000  358272000000000  007C
7  1535840000000  1535850000000  358272000000000  00E2

使用Series.str.extractall,然后与原始df连接。你知道吗

sample1.filter(regex='^(?!SPD)').join(
    sample1.SPD.str.extractall('(?P<SPD>.{4})').reset_index(level=1, drop=True)
) 

#             TST            RCV              TCU   SPD
#1  1535840000000  1535840000000  358272000000000   NaN
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0071
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  007D
#2  1535840000000  1535850000000  358272000000000  007C
#2  1535840000000  1535850000000  358272000000000  00E2

使用内部联接(。。。how='inner')如果要排除少于4个字符的行SPD。你知道吗

您可以使用^{}SPD4个字符拆分字符串,然后使用^{}从链接的解决方案中取消结果数据帧:

sample1['SPD'] = sample1.SPD.str.ljust(4, '0').str.findall(r'.{4}?')
unnesting(sample1, ['SPD'])

   SPD            TST            RCV              TCU
1  0000  1535840000000  1535840000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0071  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  007D  1535840000000  1535850000000  358272000000000
2  007C  1535840000000  1535850000000  358272000000000
2  00E2  1535840000000  1535850000000  358272000000000

相关问题 更多 >