我目前正在尝试创建一个脚本,它允许我遍历文件中包含的文本,计算单词数、不同的单词数、列出前10个最频繁的单词和计数,并将字符频率从最频繁到最不频繁进行排序
以下是我目前掌握的情况:
import sys
import os
os.getcwd()
import string
path = ""
os.chdir(path)
#Prompt for user to input filename:
fname = input('Enter the filename: ')
try:
fhand = open(fname)
except IOError:
#Invalid filename error
print('\n')
print("Sorry, file can't be opened! Please check your spelling.")
sys.exit()
#Initialize char counts and word counts dictionary
counts = {}
worddict = {}
#For character and word frequency count
for line in fhand:
#Remove leading spaces
line = line.strip()
#Convert everything in the string to lowercase
line = line.lower()
#Take into account punctuation
line = line.translate(line.maketrans('', '', string.punctuation))
#Take into account white spaces
line = line.translate(line.maketrans('', '', string.whitespace))
#Take into account digits
line = line.translate(line.maketrans('', '', string.digits))
#Splitting line into words
words = line.split(" ")
for word in words:
#Is the word already in the word dictionary?
if word in worddict:
#Increase by 1
worddict[word] += 1
else:
#Add word to dictionary with count of 1 if not there already
worddict[word] = 1
#Character count
for word in line:
#Increase count by 1 if letter
if word in counts:
counts[word] += 1
else:
counts[word] = 1
#Initialize dictionaries
lst = []
countlst = []
freqlst = []
#Count up the number of letters
for ltrs, c in counts.items():
lst.append((c,ltrs))
countlst.append(c)
#Sum up the count
totalcount = sum(countlst)
#Calculate the frequency in each dictionary
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
#Sort lists by count and percentage frequency
freqlst.sort(reverse=True)
lst.sort(reverse=True)
#Print out word counts
for key in list(worddict.keys()):
print(key, ":", worddict[key])
#Print out all letters and counts:
for ltrs, c, in lst:
print(c, '-', ltrs, '-', round(ltrs/totalcount*100, 2), '%')
当我在类似romeo.txt的东西上运行脚本时:
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
我得到这个输出:
butsoftwhatlightthroughyonderwindowbreaks : 1
itistheeastandjulietisthesun : 1
arisefairsunandkilltheenviousmoon : 1
whoisalreadysickandpalewithgrief : 1
i - 14 - 10.45 %
t - 12 - 8.96 %
e - 12 - 8.96 %
s - 11 - 8.21 %
a - 11 - 8.21 %
n - 9 - 6.72 %
h - 9 - 6.72 %
o - 8 - 5.97 %
r - 7 - 5.22 %
u - 6 - 4.48 %
l - 6 - 4.48 %
d - 6 - 4.48 %
w - 5 - 3.73 %
k - 3 - 2.24 %
g - 3 - 2.24 %
f - 3 - 2.24 %
y - 2 - 1.49 %
b - 2 - 1.49 %
v - 1 - 0.75 %
p - 1 - 0.75 %
m - 1 - 0.75 %
j - 1 - 0.75 %
c - 1 - 0.75 %
在frequency.txt上运行脚本时:
I am you you you you you I I I I you you you you I am
我得到这个输出:
iamyouyouyouyouyouiiiiyouyouyouyouiam : 1
y - 9 - 24.32 %
u - 9 - 24.32 %
o - 9 - 24.32 %
i - 6 - 16.22 %
m - 2 - 5.41 %
a - 2 - 5.41 %
我可以得到一些指导,告诉我如何将每行上的单词区分开来,并以期望的方式汇总计数吗
您正在使用此代码删除行中的所有空格。移除它,它将按照您的意愿工作
您的代码删除空格以便按空格分割–这没有意义。当您想从给定文本中提取每个单词时,我建议您将相邻的所有单词对齐,中间留一个空格–这意味着您不仅要删除新行、不必要的空格、特殊/不需要的字符和数字,还要删除控制字符
这应该可以做到:
相关问题 更多 >
编程相关推荐