从文本文件创建Python字典并检索每个单词的计数

网友

1楼 · 编辑于 2024-05-20 22:49:01

这听起来像是collections.Counter的工作：

import collections

with open('gettysburg.txt') as f:
    c = collections.Counter(f.read().split())

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果：

$ python foo.py 
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]

当然，这将“自由”和“这个”算作单词（注意单词中的标点符号）。此外，它还将“The”和“The”视为不同的单词。此外，处理整个文件可能会丢失非常大的文件。

这是一个忽略标点和大小写的版本，在大文件上更节省内存。

import collections
import re

with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line))

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果：

$ python foo.py 
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]

参考文献：

网友

2楼 · 编辑于 2024-05-20 22:49:01

有几点：

在Python中，始终使用以下构造读取文件：

 with open('ls;df', 'r') as f:
     # rest of the statements

如果您使用f.read().split()，那么它将读取到文件的末尾。之后，你需要回到开始：

f.seek(0)

第三，你所做的部分：

for w in words: 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
        i += 1 
    else: 
        i += 1 
        print words

您不需要在Python中保留计数器。你可以简单地。。。

for i, w in enumerate(words): 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
    else: 
        print words

但是，您甚至不需要在这里检查i < count。。。你可以简单地做：

words = [w.translate(None, string.punctuation).lower() for w in words]

最后，如果您只想计算states，而不想创建一个完整的项字典，请考虑使用filter。。。。

print len(filter( lambda m: m == 'states', words ))

最后一件事。。。

如果文件很大，不宜一次把每个字都记下来。考虑逐行更新wc字典。与其做你做的事，你可以考虑：

for line in f: 
    words = line.split()
    # rest of your code

网友

3楼 · 编辑于 2024-05-20 22:49:01

File_Name = 'file.txt'
counterDict={}

with open(File_Name,'r') as fh:
    for line in fh:
   # removing their punctuation
        words = line.replace('.','').replace('\'','').replace(',','').lower().split()
        for word in words:
            if word not in counterDict:
                counterDict[word] = 1
            else:
                counterDict[word] = counterDict[word] + 1

print('Count of the word > common< :: ',  counterDict.get('common',0))

相关问题更多 >

编程相关推荐

热门问题

热门文章

从文本文件创建Python字典并检索每个单词的计数

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >