擅长:python、mysql、java
<p>流式处理文件,并在逐行读取文件时进行处理。在</p>
<p>如果存储令牌的内存是个问题,那么逐行或成批地写出进程令牌。在</p>
<p><strong>逐行:</strong></p>
<pre><code>from __future__ import print_function
from nltk import word_tokenize
with open('input.txt', 'r') as fin, open('output.txt', 'w') as fout:
for line in fin:
tokenized_line = ' '.join(word_tokenize(line.strip()))
print(tokenized_line, end='\n', file=fout)
</code></pre>
<p><strong>分批(共1000个):</strong></p>
^{pr2}$