在一个带有n的pandas数据框中统计大多数“两个单词组合”的流行希伯来语单词

top_N = 30 txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') words = nltk.tokenize.word_tokenize(txt) word_dist = nltk.FreqDist(words) rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency']) print(rslt) print('=' * 60)

2条回答

网友

1楼 · 编辑于 2024-04-20 04:27:47

使用^{}：

从所有值中计数bigram的解决方案：

df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})

top_N = 3
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)

bigrm = list(nltk.bigrams(words))
print (bigrm)
[('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]

word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
print(rslt)
    Word  Frequency
0  cc cc          2
1  aa bb          1
2  bb cc          1

每列的每个拆分值的bigrams解决方案：

df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})

top_N = 3
f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)

word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
    Word  Frequency
0  aa bb          1
1  bb cc          1
2  cc cc          1

如果需要用单独的单词来计算bigram：

top_N = 3
f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)

word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])

最后按^{}绘制：

rslt.plot.bar(x='Word', y='Frequency')

网友

2楼 · 编辑于 2024-04-20 04:27:47

除了jezrael发布的内容之外，我还想介绍另一种实现这一点的方法。由于您试图获得单独的以及两个单词的频率，因此还可以利用everygram函数。你知道吗

给定数据帧：

import pandas as pd

df = pd.DataFrame()
df['notes'] = ['this is sentence one', 'is sentence two this one', 'sentence one was good']

使用everygrams(word_tokenize(x), 1, 2)获得一个单词和两个单词的形式，要获得一个、两个、三个单词的组合，可以将2更改为3，依此类推。所以你的情况应该是：

from nltk import everygrams, word_tokenize

x = df['notes'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 2)]).to_frame()

此时您应该看到：

                                               notes
0  [this, is, sentence, one, this is, is sentence...
1  [is, sentence, two, this, one, is sentence, se...
2  [sentence, one, was, good, sentence one, one w...

您现在可以通过展平列表和值\u计数来获得计数：

import numpy as np

flattenList = pd.Series(np.concatenate(x.notes))
freqDf = flattenList.value_counts().sort_index().rename_axis('notes').reset_index(name = 'frequency')

最终输出：

           notes  frequency
0           good          1
1             is          2
2    is sentence          2
3            one          3
4        one was          1
5       sentence          3
6   sentence one          2
7   sentence two          1
8           this          2
9        this is          1
10      this one          1
11           two          1
12      two this          1
13           was          1
14      was good          1

现在绘制图表很简单：

import matplotlib.pyplot as plt 

plt.figure()
flattenList.value_counts().plot(kind = 'bar', title = 'Count of 1-word and 2-word frequencies')
plt.xlabel('Words')
plt.ylabel('Count')
plt.show()

输出：

相关问题更多 >

编程相关推荐

热门问题

热门文章