使用cuDF将字符串序列拆分成块

0 投票

1 回答

37 浏览

提问于 2025-04-14 17:54

我有一个cuDF的序列，里面包含了很长的字符串，我想把每个字符串分成大小相等的小块。

我写的代码大概是这样的：

import cudf                                                                                            
                                                                                                                                                                             
s = cudf.Series(["abcdefg", "hijklmnop"])                                                              
                                                                                                       
def chunker(string):                                                                                   
    chunk_size = 3                                                                                     
    return [string[i:i+chunk_size] for i in range(0, len(string), chunk_size)]                         
                                                                                                       
print(s.apply(chunker))

但是这段代码出现了错误：

No implementation of function Function(<class 'range'>) found for signature:
 
 >>> range(Literal[int](0), Masked(int32), Literal[int](3))

如果我把len(string)换成一个固定的数字，代码又会报另一个错误，说是索引出问题：

No implementation of function Function(<built-in function getitem>) found for signature:
 
 >>> getitem(Masked(string_view), slice<a:b>)

这段代码在普通的Pandas中运行得很好，但我希望能在一些非常大的数据集上运行，并利用cuDF的GPU操作来提高效率。

数据处理字符串操作索引错误大数据 cudf gpu计算

1 个回答

你可以使用 str.findall 这个方法来完成这个操作，配合一个正则表达式，可以匹配任何字符，次数在1到3之间（也就是块的大小），这样在pandas和cuDF中会更快：

import pandas as pd
import cudf

N = 1000000
s = pd.Series(["abcdefg", "hijklmnop"]*N)
gs = cudf.from_pandas(s)

%time out = s.str.findall(".{1,3}")
%time out = gs.str.findall(".{1,3}")
out.head()
CPU times: user 3.55 s, sys: 164 ms, total: 3.72 s
Wall time: 3.7 s
CPU times: user 118 ms, sys: 31.8 ms, total: 150 ms
Wall time: 150 ms

0      [abc, def, g]
1    [hij, klm, nop]
2      [abc, def, g]
3    [hij, klm, nop]
4      [abc, def, g]
dtype: list

你可能还会对 cudf.pandas 感兴趣，它是一个可以让你的pandas代码零改动就能加速的工具。

回答于 2025-04-14 由 Python大师

分享举报

使用cuDF将字符串序列拆分成块

1 个回答

撰写回答