在一个带有n的pandas数据框中统计大多数“两个单词组合”的流行希伯来语单词

2024-04-20 04:27:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个csv数据文件,包含列'笔记'满意的答案在希伯来语。你知道吗

我想找出最流行的单词和流行的“2词组合”,它们出现的次数,并绘制在条形图中。你知道吗

到目前为止我的代码是:

PYTHONIOENCODING="UTF-8"  
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])
words= df.notes.str.split(expand=True).stack().value_counts()

这会产生一个带有计数器的单词列表,但会考虑希伯来语中的所有停止词,不会产生“2单词组合”频率。 我也试过这个代码,但它不是我想要的:

 top_N = 30
 txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
 words = nltk.tokenize.word_tokenize(txt)
 word_dist = nltk.FreqDist(words)
 rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
 print(rslt)
 print('=' * 60)

如何使用nltk来实现这一点?你知道吗


Tags: csv代码txtdfdisttop单词word
2条回答

使用^{}

从所有值中计数bigram的解决方案:

df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})

top_N = 3
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)

bigrm = list(nltk.bigrams(words))
print (bigrm)
[('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]

word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
print(rslt)
    Word  Frequency
0  cc cc          2
1  aa bb          1
2  bb cc          1

每列的每个拆分值的bigrams解决方案:

df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})

top_N = 3
f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)

word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
    Word  Frequency
0  aa bb          1
1  bb cc          1
2  cc cc          1

如果需要用单独的单词来计算bigram:

top_N = 3
f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)

word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])

最后按^{}绘制:

rslt.plot.bar(x='Word', y='Frequency')

除了jezrael发布的内容之外,我还想介绍另一种实现这一点的方法。由于您试图获得单独的以及两个单词的频率,因此还可以利用everygram函数。你知道吗

给定数据帧:

import pandas as pd

df = pd.DataFrame()
df['notes'] = ['this is sentence one', 'is sentence two this one', 'sentence one was good']

使用everygrams(word_tokenize(x), 1, 2)获得一个单词和两个单词的形式,要获得一个、两个、三个单词的组合,可以将2更改为3,依此类推。所以你的情况应该是:

from nltk import everygrams, word_tokenize

x = df['notes'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 2)]).to_frame()

此时您应该看到:

                                               notes
0  [this, is, sentence, one, this is, is sentence...
1  [is, sentence, two, this, one, is sentence, se...
2  [sentence, one, was, good, sentence one, one w...

您现在可以通过展平列表和值\u计数来获得计数:

import numpy as np

flattenList = pd.Series(np.concatenate(x.notes))
freqDf = flattenList.value_counts().sort_index().rename_axis('notes').reset_index(name = 'frequency')

最终输出:

           notes  frequency
0           good          1
1             is          2
2    is sentence          2
3            one          3
4        one was          1
5       sentence          3
6   sentence one          2
7   sentence two          1
8           this          2
9        this is          1
10      this one          1
11           two          1
12      two this          1
13           was          1
14      was good          1

现在绘制图表很简单:

import matplotlib.pyplot as plt 

plt.figure()
flattenList.value_counts().plot(kind = 'bar', title = 'Count of 1-word and 2-word frequencies')
plt.xlabel('Words')
plt.ylabel('Count')
plt.show()

输出:

enter image description here

相关问题 更多 >