Python按词频降序排列
我正在使用这段代码来计算一个文本文件中单词出现的频率:
#!/usr/bin/python
file=open("out1.txt","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
我该如何将输出结果按出现频率从高到低的顺序打印出来呢?
4 个回答
1
使用 Counter
模块。
from collections import Counter
s = "This is a sentence this is a this is this"
c = Counter(s.split())
#s.split() is an array of words, it splits it at each space if no parameter is given to split on
print c
>>> Counter({'is': 3, 'this': 3, 'a': 2, 'This': 1, 'sentence': 1})
不过,这样做在处理句号和大写字母时可能不太准确。你可以简单地去掉单词末尾的句号,这样就能正确计数,同时把所有字母都变成小写或大写,这样就不区分大小写了。
你可以用下面的方法解决这两个问题:
s1 = "This is a sentence. This is a. This is. This."
s2 = ""
for word in s1.split():
#punctuation checking, you can make this more robust through regex if you want
if word.endswith('.') or word.endswith('!') or word.endswith('?'):
s2 += word[:-1] + " "
else:
s2 += word + " "
c = Counter(s2.lower().split())
print c
>>> Counter({'this': 4, 'is': 3, 'a': 2, 'sentence': 1})
2
你可以创建一个包含元组的列表,然后对这个列表进行排序。下面是一个例子。
wordcount = {'cat':1,'dog':2,'kangaroo':20}
ls = [(k,v) for (k,v) in wordcount.items()]
ls.sort(key=lambda x:x[1],reverse=True)
for k,v in ls:
print k, v
...输出结果...
kangaroo 20
dog 2
cat 1
7
使用 Counter.most_common
而不指定任何值,可以得到一个按词频从高到低排列的列表。
from collections import Counter
word_count = Counter()
with open("out1.txt","r+") as file:
word_count.update((word for word in file.read().split()))
for word, count in word_count.most_common():
print word, count
>>> the 6
Lorem 4
of 4
and 3
Ipsum 3
text 2
type 2
2
这里是代码:
file=open("out1.txt","r+")
wordcount={}
for word in file.read().split():
word = word.lower()
if word.isalpha == True:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
copy = []
for k,v in wordcount.items():
copy.append((v, k))
copy = sorted(copy, reverse=True)
for k in copy:
print '%s: %d' %(k[1], k[0])
Out1.txt
:
hello there I am saying hello world because Bob is here and I am saying hello because John is here
运行结果是:
hello: 3
saying: 2
is: 2
here: 2
because: 2
am: 2
I: 2
world: 1
there: 1
and: 1
John: 1
Bob: 1