将数据帧行中的单词与字典的键进行比较

import pandas as pd test_df = pd.DataFrame({ '_id': ['1a','2b','3c','4d'], 'column': ['und der in zu', 'Kompliziertereswort something', 'Lehrerin in zu [Buch]', 'Buch (Lehrerin) kompliziertereswort']})

_id column score 1a und der in zu 20 2b Kompliziertereswort something 2 3c Lehrerin in zu [Buch] 15 4d Buch (Lehrerin) kompliziertereswort 5

2条回答

网友

1楼 · 编辑于 2024-05-14 04:18:30

首先，应将词典转换为数据帧：

d = {'und': 20, 'der': 10,'in':  40, 'zu':  10,'Kompliziertereswort': 2, 'Buch': 5,'Lehrerin': 5}
d_df = pd.DataFrame({'column':[k for k in d],'number':[d[k] for k in d]})
d_df

column          number
und                20
der                10
in                 40
zu                 10
Kompliziertereswort 2
Buch                5
Lehrerin            5

然后使用pandas的explode（）函数分离列中的单词并将其与d_df连接：

test_df2 = test_df.set_index(['_id']).apply(lambda x: x.str.split(' ').explode()).reset_index()
test_df2 = pd.merge(test_df2,d_df,on='column',how='left')

_id column            number
1a  und                20.0
1a  der                10.0
1a  in                 40.0
1a  zu                 10.0
2b  Kompliziertereswort 2.0
2b  something           nan
3c  Lehrerin            5.0
3c  in                  40.0
3c  zu                  10.0
3c  [Buch]              nan
4d  Buch                5.0
4d  (Lehrerin)          nan
4d  kompliziertereswort nan

计算每个id的平均值：

row_means = test_df2.groupby('_id')['number'].agg(['mean']).reset_index()

_id   mean
1a  20.000000
2b  2.000000
3c  18.333333
4d  5.00000

现在您可以将row_means连接到主数据帧（test_df）并向其中添加mean列

pd.merge(test_df,row_means,on='_id',how='left')

_id column                              mean
1a  und der in zu                       20.000000
2b  Kompliziertereswort something       2.000000
3c  Lehrerin in zu [Buch]               18.333333
4d  Buch (Lehrerin) kompliziertereswort 5.000000

网友

2楼 · 编辑于 2024-05-14 04:18:30

我们可以使用字典中的键构造一个正则表达式模式，然后从每一行提取该模式的所有匹配项，然后map将字典中的分数d转换为匹配字符串，并在level=0上取mean得到平均值

pat = fr"\b({'|'.join(d)})\b"
test_df['score'] = test_df['column'].str.extractall(pat)[0].map(d).mean(level=0)

结果

print(test_df)

  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0

结果

相关问题更多 >

编程相关推荐

热门问题

热门文章