将函数应用于数据帧列

2024-05-16 14:53:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧:

  name    sample
1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err 

最后我想说:

 name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
1  a       1     1     1     1    0     0    0     0       0       0
2  b       0     0     1     1    1     1    0     0       0       0
...
30  p      0    1      0     0    0     0    0     1       1       0

我创建了以下函数:

def cleanattributes(istring):

    istring=str(istring)
    istring=istring.rstrip().split('\\n')

    counter=0
    for line in istring:
        istring[counter]=istring[counter].rpartition(': ')[-1]
        counter+=1
    istring=str(istring)
    istring = istring.replace("'", "")
    istring = istring.replace("\"", "")
    return(str(istring))

此函数创建返回类别信息而不返回类别标题的预期结果(其思想是使用getdummies获取列)

teststring="Category 1: qwe, asd\\nCategory 2: sdf, erg"
cleanattributes(teststring)
OUTPUT: '[qwe, asd, sdf, erg]'

我不确定如何最好地将此函数应用于每条记录,以便数据帧如下所示:

  name    sample
1  a      qwe, asd, sdf, erg
2  b      sdf, erg, zxc, eru
...
30  p      asd, 2134, EFDgh, Pdr tke, err 

或者这是最好的方法。你知道吗

按要求:

df['sample'].iat[0]
OUTPUt= 'Category 1: qwe, asd\nCategory 2: sdf, erg'

Tags: samplenamelinecountererucategorysdfasd
1条回答
网友
1楼 · 发布于 2024-05-16 14:53:35
df = pd.DataFrame(
    {'name': ['a', 'b'],
     'sample': ['Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err', 
                'Category 2: sdf, erg\nCategory 5: zxc, eru\nCategory 1: asd, Category PE: 2134, EFDgh, Pdr tke, err']}

df2 = pd.concat([df.name, 
                 df['sample']
                 .str.replace("(Category .*: )+", '')  # Remove "Category [*]:"
                 .str.replace(r'\n', '')  # Remove "\n"
                 .str.split(', ', expand=True)], 
                axis=1)

df3 = pd.melt(df2, id_vars='name')[['name', 'value']]

>>> pd.concat([df3['name'], pd.get_dummies(df3['value'])], axis=1)
   name  2134  EFDgh  Pdr tke  ergzxc  err  eru2134  sdf
0     a     1      0        0       0    0        0    0
1     b     0      0        0       0    0        0    1
2     a     0      1        0       0    0        0    0
3     b     0      0        0       1    0        0    0
4     a     0      0        1       0    0        0    0
5     b     0      0        0       0    0        1    0
6     a     0      0        0       0    1        0    0
7     b     0      1        0       0    0        0    0
8     a     0      0        0       0    0        0    0
9     b     0      0        1       0    0        0    0
10    a     0      0        0       0    0        0    0
11    b     0      0        0       0    1        0    0

相关问题 更多 >