Pandas的条件词频计数

data = {'speaker':['Adam','Ben','Clair'], 'speech': ['Thank you very much and good afternoon.', 'Let me clarify that because I want to make sure we have got everything right', 'By now you should have some good rest']} df = pd.DataFrame(data)

speaker speech words 0 Adam Thank you very much and good afternoon. 2 1 Ben Let me clarify that because I want to make sur... 1 2 Clair By now you should have received a copy of our ... 1

3条回答

网友

1楼 · 编辑于 2024-06-16 10:47:25

您可以使用以下矢量化方法：

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))

其中：

>>> df
  speaker                                             speech  total
0    Adam            Thank you very much and good afternoon.      2
1     Ben  Let me clarify that because I want to make sur...      1
2   Clair              By now you should have some good rest      1

网友

2楼 · 编辑于 2024-06-16 10:47:25

import pandas as pd

data = {'speaker': ['Adam', 'Ben', 'Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good', 'right']

df["speech"] = df["speech"].str.split()
df = df.explode("speech")
counts = df[df.speech.isin(wordlist)].groupby("speaker").size()
print(counts)

网友

3楼 · 编辑于 2024-06-16 10:47:25

如果您有一个非常大的列表和一个大的数据帧要搜索，那么这是一个更快的（运行时方面的）解决方案

我猜这是因为它利用了字典（需要O（N）来构造，需要O（1）来搜索）。就性能而言，正则表达式搜索速度较慢

import pandas as pd
from collections import Counter

def occurrence_counter(target_string, search_list):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in search_list:
        if key in data:
            count+=data[key]
    return count

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['speech'].apply(lambda x: occurrence_counter(x, wordlist))

相关问题更多 >

编程相关推荐

热门问题

热门文章