Python Pandas NLTK：使用bigramlocationfind显示数据帧中文本字段中常见短语（ngram）的频率

No category problem_definition_stopwords 175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'] 211 1438 ['galley', 'work', 'table', 'stuck'] 912 2698 ['cloth', 'stuck'] 572 2521 ['stuck', 'coffee']

1条回答

网友

1楼 · 发布于 2024-04-26 20:22:04

这应该可以。。。在

首先，设置数据集（或类似的数据集）：

import pandas as pd
from nltk.collocations import *
import nltk.collocations
from nltk import ngrams
from collections import Counter

s = pd.Series(
    [
        ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
        ['galley', 'work', 'table', 'stuck'],
        ['cloth', 'stuck'],
        ['stuck', 'coffee']
    ]
)

finder = BigramCollocationFinder.from_documents(s.values)
bigram_measures = nltk.collocations.BigramAssocMeasures()

# only bigrams that appear 1+ times
finder.apply_freq_filter(1) 

# return the 10 n-grams with the highest PMI
result = finder.nbest(bigram_measures.pmi, 10)

使用nltk.ngrams重新创建ngrams列表：

^{pr2}$

使用collections.Counter计算每个ngram在整个语料库中出现的次数：

counts = Counter(ngram_list).most_common()

构建一个看起来像您想要的数据帧：

pd.DataFrame.from_records(counts, columns=['gram', 'count'])
                   gram  count
0            (420, 420)      2
1       (coffee, maker)      1
2      (maker, brewing)      1
3   (brewing, properly)      1
4         (properly, 2)      1
5              (2, 420)      1
6        (galley, work)      1
7         (work, table)      1
8        (table, stuck)      1
9        (cloth, stuck)      1
10      (stuck, coffee)      1

然后，您可以筛选以仅查看由finder.nbest调用生成的ngram：

df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python Pandas NLTK：使用bigramlocationfind显示数据帧中文本字段中常见短语（ngram）的频率

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >