读取文本文件并查找关键字列表中的特定词汇
我刚开始学习Python,想写一个脚本来处理一个文本文件(text_file_1),这个文件里有一段文字。我希望这个脚本能读取这段文字,并查找一些我事先定义好的关键词,这些关键词放在一个叫做(key_words)的列表里。关键词包括首字母大写的单词(比如Nation)和小写的单词(比如nation)。在Python完成搜索后,它会把找到的单词以垂直的方式输出到一个新的文本文件,文件名叫“List of Words”,同时还要显示每个单词在文本中出现的次数。如果我再读取另一个文本文件(text_file_2),它也会做同样的事情,但会把找到的单词添加到原来的“List of Words”中。
举个例子:
单词列表
文件 1:
God: 5
Nation: 4
creater: 8
USA: 3
文件 2:
God: 10
Nation: 14
creater: 2
USA: 1
这是我目前写的代码:
from sys import argv
from string import punctuation
script = argv[0] all_filenames = argv[1:]
print "Text file to import and read: " + all_filenames
print "\nReading file...\n"
text_file = open(all_filenames, 'r')
all_lines = text_file.readlines()
#print all_lines
text_file.close()
for all_filenames in argv[1:]:
print "I get: " + all_filenames
print "\nFile read finished!"
#print "\nYour file contains the following text information:"
#print "\n" + text_file.read()
#~ for word, count in word_freq.items():
#~ print word, count
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
也许可以用这个代码来实现?
import fileinput
for line in fileinput.input('List_of_words.txt', inplace = True):
if line.startswith('Existing file that was read'):
#if line starts Existing file that was read then do something here
print "Existing file that was read"
elif line.startswith('New file that was read'):
#if line starts with New file that was read then do something here
print "New file that was read"
else:
print line.strip()
1 个回答
0
这样你就能在屏幕上看到结果了。
from sys import argv
from collections import Counter
from string import punctuation
script, filename = argv
text_file = open(filename, 'r')
word_freq = Counter([word.strip(punctuation) for line in text_file for word in line.split()])
#~ for word, count in word_freq.items():
#~ print word, count
key_words = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater'
'Country', 'country', 'People', 'people', 'Liberty', 'liberty',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage']
for word in key_words:
if word in word_freq:
print word, word_freq[word]
现在你需要把它保存到文件里。
如果有多个文件,可以使用
for filename in argv[1:]:
# do your job
编辑:
使用这个代码(my_script.py)
for filename in argv[1:]:
print( "I get", filename )
你可以运行这个脚本
python my_script.py file1.txt file2.txt file3.txt
然后得到
I get file1.txt
I get file2.txt
I get file3.txt
你可以用它来计算多个文件中的单词数量。
-
使用 readlines()
可以把所有行读入内存,这样会占用更多内存——对于非常大的文件,这可能会成为问题。
在当前版本中,Counter()
会计算所有行中的所有单词——你可以试试——但它使用的内存更少。
所以使用 readlines()
你得到的 word_freq
是一样的,但会占用更多内存。
-
writelines(list_of_result)
不会在每行后面加上 "\n"——而且在 "God: 3" 中也不会加上 ':'
最好使用类似于
output_file = open("List_of_words.txt", "w")
for word in key_words:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
编辑: 新版本会把结果追加到 List_of_words.txt 的末尾
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq = Counter([word.strip(punctuation) for line in all_lines for word in line.split()])
print "Append result to the end of file: List_of_words.txt"
output_file = open("List_of_words.txt", "a")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
编辑: 把结果的总和写入一个文件
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
word_freq = Counter()
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq.update( [word.strip(punctuation) for line in all_lines for word in line.split()] )
print "Write sum of results: List_of_words.txt"
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()