Python中字母频率分析
我需要写一个程序,能够打印出文本文件中每个字母出现的频率,并把这个频率和另一个文件的进行比较。
到目前为止,我已经能打印出某个字母出现的次数,但我计算的百分比频率是错的。我觉得问题出在我需要让程序只计算文件中的字母数量,也就是要去掉所有的空格和其他字符。
def addLetter (x):
result = ord(x) - ord(a)
return result
#start of the main program
#prompt user for a file
while True:
speech = raw_input("Enter file name:")
wholeFile = open(speech, 'r+').read()
lowlet = wholeFile.lower()
letters= list(lowlet)
alpha = list('abcdefghijklmnopqrstuvwxyz')
n = len(letters)
f = float(n)
occurrences = {}
d = {}
#number of letters
for x in alpha:
occurrences[x] = letters.count(x)
d[x] =(occurrences[x])/f
for x in occurrences:
print x, occurrences[x], d[x]
这是输出结果
Enter file name:dems.txt
a 993 0.0687863674148
c 350 0.0242449431976
b 174 0.0120532003325
e 1406 0.0973954003879
d 430 0.0297866444999
g 219 0.015170407315
f 212 0.0146855084511
i 754 0.0522305347742
h 594 0.0411471321696
k 81 0.00561097256858
j 12 0.000831255195345
m 273 0.0189110556941
l 442 0.0306178996952
o 885 0.0613050706567
n 810 0.0561097256858
q 9 0.000623441396509
p 215 0.0148933222499
s 672 0.0465502909393
r 637 0.0441257966196
u 305 0.021127736215
t 1175 0.0813937378775
w 334 0.0231366029371
v 104 0.00720421169299
y 212 0.0146855084511
x 13 0.000900526461624
z 6 0.000415627597672
Enter file name:
程序确实是按列打印的,但我不太确定怎么在这里展示出来。
字母“a”的频率应该是0.0878。
3 个回答
0
我觉得这是一种非常简单直接的方法:
while True:
speech = raw_input("Enter file name:")
wholeFile = open(speech, 'r+').read()
lowlet = wholeFile.lower()
alphas = 'abcdefghijklmnopqrstuvwxyz'
# lets set default values first
occurrences = {letter : 0 for letter in alphas }
# occurrences = dict(zip(alphas, [0]*len(alphas))) # for python<=2.6
# total number of valid letters
total = 0
# iter everything in the text
for letter in lowlet:
# if it is a valid letter then it is in occurrences
if letter in occurrences:
# update counts
total += 1
occurrences[letter] += 1
# now print the results:
for letter, count in occurrences.iteritems():
print letter, (1.0*count/total)
正如你所注意到的,在计算字母出现的频率之前,你需要知道文本中有效字母的总数。你可以在处理文本之前先过滤掉不需要的部分,或者把过滤和处理结合在一起,这就是我在这里所做的。
3
import collections
import re
from __future__ import division
file1 = re.subn(r"\W", "", open("file1.txt", "r").read())[0].lower()
counter1 = collections.Counter(file1)
for k, v in counter1.iteritems():
counter1[k] = v / len(file1)
file2 = re.subn(r"\W", "", open("file2.txt", "r").read())[0].lower()
counter2 = collections.Counter(file2)
for k, v in counter2.iteritems():
counter2[k] = v / len(file2)
注意:需要使用 Python 2.7 版本。
3
你可以使用一个叫做翻译器的做法,来去掉所有不在alpha
中的字符。这样做之后,letters
就只会包含alpha
中的字符,这样n
就变成了正确的分母。
接着,你可以用collections.defaultdict(int)
来计算字母出现的次数:
import collections
import string
def translator(frm='', to='', delete='', keep=None):
# Python Cookbook Recipe 1.9
# Chris Perkins, Raymond Hettinger
if len(to) == 1: to = to * len(frm)
trans = string.maketrans(frm, to)
if keep is not None:
allchars = string.maketrans('', '')
# delete is expanded to delete everything except
# what is mentioned in set(keep)-set(delete)
delete = allchars.translate(allchars, keep.translate(allchars, delete))
def translate(s):
return s.translate(trans, delete)
return translate
alpha = 'abcdefghijklmnopqrstuvwxyz'
keep_alpha=translator(keep=alpha)
while True:
speech = raw_input("Enter file name:")
wholeFile = open(speech, 'r+').read()
lowlet = wholeFile.lower()
letters = keep_alpha(lowlet)
n = len(letters)
occurrences = collections.defaultdict(int)
for x in letters:
occurrences[x]+=1
for x in occurrences:
print x, occurrences[x], occurrences[x]/float(n)