我正在编写一个脚本,在一个目录中查找所有文本文件,然后查找文件中的行数和最常用的单词。我知道这不是最简单/最整洁的方法,但我对python还很陌生(2周)。你知道吗
我遇到的一个小问题是我有两本主要的词典。一个存储文件和行数,另一个存储文件、行数和字数,其频率如下:
dict1_example = {'file':'lines'}
dict2_example = {'file': 'lines', ('word':'count')}
我希望能够从所有文件中提取最频繁的单词,即访问第二个字典的('word':'count')位。你知道吗
有没有一种方法可以仅仅从这个部分获取信息,或者我需要使用函数来创建一个额外的字典??你知道吗
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import glob
import os
from sys import argv
import re
from collections import Counter
script, directory = argv
def file_len2(filename2):
with open(filename2) as f2:
l2 = [x for x in f2.readlines() if x != "\n"]
return len(l2)
def word_count(filename3):
with open(filename3) as f3:
passage = f3.read()
stop_words = ("THE", "OF", "A", "TO", "AND", "IS", "IN", "YOU", "THAT", "IT", "THIS", "YOUR", "AS", "AN", "BUT", "FOR")
words = re.findall(r'\w+', passage)
cap_words = [word.upper() for word in words if word.upper() not in stop_words]
word_counts = Counter(cap_words)
return max(word_counts, key=word_counts.get), word_counts[max(word_counts, key=word_counts.get)]
files = glob.glob(directory + "/*.txt")
length = {}
file_info = {}
for file in files:
lines = file_len2(file)
length[file] = lines
file_info[file] = lines, word_count(file)
for file, lines in length.iteritems():
print '{}: {}'.format(os.path.basename(file), lines), word_count(file);
maximum_file = max(length, key=length.get)
minimum_file = min(length, key=length.get)
maximum_lines = os.path.basename(max(length, key=length.get))
minimum_lines = os.path.basename(min(length, key=length.get))
print "The file with the maximum number of lines:"
print "%r lines in %r " % (length[maximum_file], maximum_lines)
print "The file with the minimum number of lines:"
print "%r lines in %r" % (length[minimum_file], minimum_lines)
sum_lines = sum(length.values())
number_of_values = len(length)
average = sum_lines / number_of_values
print "The average number of lines in a text file in given directory: ", average, "- Rounded down"
我似乎又做了一个口述来解决我的问题:
通过切换
然后我用这个来得到最常见的词:
相关问题 更多 >
编程相关推荐