<p>这听起来像是<code>collections.Counter</code>的工作:</p>
<pre><code>import collections
with open('gettysburg.txt') as f:
c = collections.Counter(f.read().split())
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
</code></pre>
<p>结果:</p>
<pre class="lang-none prettyprint-override"><code>$ python foo.py
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]
</code></pre>
<hr/>
<p>当然,这将“自由”和“这个”算作单词(注意单词中的标点符号)。此外,它还将“The”和“The”视为不同的单词。此外,处理整个文件可能会丢失非常大的文件。</p>
<p>这是一个忽略标点和大小写的版本,在大文件上更节省内存。</p>
<pre><code>import collections
import re
with open('gettysburg.txt') as f:
c = collections.Counter(
word.lower()
for line in f
for word in re.findall(r'\b[^\W\d_]+\b', line))
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
</code></pre>
<p>结果:</p>
<pre class="lang-none prettyprint-override"><code>$ python foo.py
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]
</code></pre>
<p>参考文献:</p>
<ul>
<li><a href="https://docs.python.org/2/library/re.html" rel="nofollow noreferrer">https://docs.python.org/2/library/re.html</a></li>
<li><a href="https://docs.python.org/2/library/collections.html#collections.Counter" rel="nofollow noreferrer">https://docs.python.org/2/library/collections.html#collections.Counter</a></li>
<li><a href="https://stackoverflow.com/questions/5717886/extracting-whole-words">Extracting whole words</a></li>
</ul>