使用Python统计文本文件中所有单词的出现次数
怎样从一个文本文件创建一个字典呢?
这个测试文件的内容是:
Von Neumann architecture describes a general framework, or structure, that a computer's hardware, programming, and data should follow. Although other structures for computing have been devised and implemented, the vast majority of computers in use today operate according to the von Neumann architecture.The von Neumann in von Neumann architecture refers to Hungarian-American mathematician John von Neumann (1903-1957). Von Neumann was initially interested in access to the fastest computers available (of which there were few) during World War II in order to perform complex computations for a variety of war-related problems. In 1944, Von Neumann became a consultant to the ENIAC (Electronic Numerical Integrator and Computer) project, which upon its completion in 1945 became the world's first general purpose, electronic computer. Even before ENIAC's completion, von Neumann and several members of the team constructing ENIAC proposed building a more advanced computer, which would eventually become known as EDVAC (Electronic Discrete Variable Automatic Computer). In 1945 von Neumann wrote a landmark paper entitled The First Draft of a Report on the EDVAC, which encapsulated his ideas concerning the fundamental structure that a computer should follow. That report, which Von Neumann originally intended to be seen by a limited group of associates, nevertheless became widely disseminated and had an immediate impact on computer development in the United States and abroad.Von Neumann followed up on his first report by producing two more papers coauthored with colleagues from the ENIAC team. What emerged from these three papers was an overall structure, or architecture, which is by-and-large followed to this day by the vast majority of electronic, digital computers. Von Neumann envisioned the structure of a computer system as being composed of the following components: (1) the central arithmetic unit, which today is called the arithmetic-logic unit (ALU). This unit performs the computer's computational and logical functions; (2) memory; more specifically, the computer's main, or fast, memory, such as random access memory (RAM); (3) a control unit that directs other components of the computer to perform certain actions, such as directing the fetching of data or instructions from memory to be processed by the ALU; and (4) man-machine interfaces; i.e., input and output devices, such as a keyboard for input and display monitor for output. Of course, computer technology has developed extensively since von Neumann's time. For instance, due to integrated circuitry and miniaturization the ALU and control unit have been integrated onto the same microprocessor chip, becoming an integrated part of the computer's central processing unit (CPU).The most noteworthy concept contained in von Neumann's first report was most likely that of the stored-program principle. This principle holds that data, as well as the instructions used to manipulate that data, should be stored together in the same memory area of the computer. This idea deviated from the structure of previous computers. For example, ENIAC's numeric data was stored in its vacuum tube memory, while the instructions that directed the processing of that data was provided by certain hardware settings. That is to say, before each new computation with ENIAC, an operator set various dials, connected and disconnected various electric plugs, and so forth. Those particular hardware settings represented ENIAC's programming. It seemed obvious to von Neumann (as it did to several other people working on the ENIAC project) that to have a flexible, truly general-purpose computer meant that the stored program principle should be implemented.One ramification of storing data and programming in the same general area of the computer's main memory is the need to distinguish between the two. The contents of the typical computer's main memory is seen by the computer as a series of zeroes and ones (i.e., binary digits, or bits). The computer needs direction in order to determine whether a particular block of information is data or instructions. Von Neumann's control unit is the mechanism used to make the data-versus-instruction determination. When the control unit initiates a call for an instruction to be fetched for processing, a unit called the program counter points to the instruction's location in memory (i.e., its address in memory). The instruction is then fetched for execution by the processor. The address in memory of any data that is required is provided by the instruction itself. During this fetching and execution of an instruction, the program counter is incremented so that the next instruction can be found and executed. This process is sequential, meaning that instructions are executed in an ordered, sequential fashion, one instruction at a time. This serial execution of instructions is a hallmark of the von Neumann computer architecture. It is in contrast to parallel processing architectures in which multiple instructions are executed in tandem. A true parallel processing computer is considered a non-von Neumann architecture machine.To summarize the main characteristics of the von Neumann architecture, it is noted that, first of all, such a computer is composed of distinct components, which are the ALU, control unit, input/output devices, and a single memory unit for storing both data and instructions (i.e., the stored-program principle). Secondly, instructions are carried out sequentially, one instruction at a time. As von Neumann himself recognized, the sequential execution of programming imposes a sort of speed limit on program execution since only one instruction at a time can be handled by the computer's processor. Computer pioneer John Backus called this the von Neumann bottleneck. This bottleneck can manifest itself when the computer's CPU processes at a rate faster than information can be delivered from main memory. There have been a plethora of techniques devised to make the most of the sequential nature that von Neumann architecture places on computers by reducing any information bottlenecks. The development of faster processors has meant that programs are executed more quickly. Processing speed has also been increased by modifying the memory side of the equation, as in the case of cache memory (which basically provides a way of transferring information from main memory into a smaller, faster memory device). Other techniques include wider data buses to carry information more quickly between memory and the CPU; reduction of wait states (i.e., reduction of the time the CPU is required to suspend processing while waiting for information from auxiliary storage); and many other speed-enhancing strategies. It must be pointed out, however, that despite these advances and enhancements one is still left with the fundamental von Neumann architecture, which is followed in the overwhelming majority of computers in use today.
我需要统计独特的单词
print(len(set(w.lower() for w in open('von_neumann.txt').read().split())))
然后创建一个字典,字典里的键是文件中找到的每一个单词,而值则是每个单词在文本中出现的次数。
我使用的是Python 3.3.2。
5 个回答
0
这里提到的是使用 Counter
这个工具,它来自于 collections
这个模块。你可以在这个链接找到更多信息:http://docs.python.org/2/library/collections.html。
from collections import Counter
text = """Von Neumann architecture describes a general framework, or structure, that a computer's hardware, programming, and data should follow. Although other structures for computing have been devised and implemented, the vast majority of computers in use today operate according to the von Neumann architecture.The von Neumann in von Neumann architecture refers to Hungarian-American mathematician John von Neumann (1903-1957). Von Neumann was initially interested in access to the fastest computers available (of which there were few) during World War II in order to perform complex computations for a variety of war-related problems. In 1944, Von Neumann became a consultant to the ENIAC (Electronic Numerical Integrator and Computer) project, which upon its completion in 1945 became the world's first general purpose, electronic computer. Even before ENIAC's completion, von Neumann and several members of the team constructing ENIAC proposed building a more advanced computer, which would eventually become known as EDVAC (Electronic Discrete Variable Automatic Computer). In 1945 von Neumann wrote a landmark paper entitled The First Draft of a Report on the EDVAC, which encapsulated his ideas concerning the fundamental structure that a computer should follow. That report, which Von Neumann originally intended to be seen by a limited group of associates, nevertheless became widely disseminated and had an immediate impact on computer development in the United States and abroad.Von Neumann followed up on his first report by producing two more papers coauthored with colleagues from the ENIAC team. What emerged from these three papers was an overall structure, or architecture, which is by-and-large followed to this day by the vast majority of electronic, digital computers. Von Neumann envisioned the structure of a computer system as being composed of the following components: (1) the central arithmetic unit, which today is called the arithmetic-logic unit (ALU). This unit performs the computer's computational and logical functions; (2) memory; more specifically, the computer's main, or fast, memory, such as random access memory (RAM); (3) a control unit that directs other components of the computer to perform certain actions, such as directing the fetching of data or instructions from memory to be processed by the ALU; and (4) man-machine interfaces; i.e., input and output devices, such as a keyboard for input and display monitor for output. Of course, computer technology has developed extensively since von Neumann's time. For instance, due to integrated circuitry and miniaturization the ALU and control unit have been integrated onto the same microprocessor chip, becoming an integrated part of the computer's central processing unit (CPU).The most noteworthy concept contained in von Neumann's first report was most likely that of the stored-program principle. This principle holds that data, as well as the instructions used to manipulate that data, should be stored together in the same memory area of the computer. This idea deviated from the structure of previous computers. For example, ENIAC's numeric data was stored in its vacuum tube memory, while the instructions that directed the processing of that data was provided by certain hardware settings. That is to say, before each new computation with ENIAC, an operator set various dials, connected and disconnected various electric plugs, and so forth. Those particular hardware settings represented ENIAC's programming. It seemed obvious to von Neumann (as it did to several other people working on the ENIAC project) that to have a flexible, truly general-purpose computer meant that the stored program principle should be implemented.One ramification of storing data and programming in the same general area of the computer's main memory is the need to distinguish between the two. The contents of the typical computer's main memory is seen by the computer as a series of zeroes and ones (i.e., binary digits, or bits). The computer needs direction in order to determine whether a particular block of information is data or instructions. Von Neumann's control unit is the mechanism used to make the data-versus-instruction determination. When the control unit initiates a call for an instruction to be fetched for processing, a unit called the program counter points to the instruction's location in memory (i.e., its address in memory). The instruction is then fetched for execution by the processor. The address in memory of any data that is required is provided by the instruction itself. During this fetching and execution of an instruction, the program counter is incremented so that the next instruction can be found and executed. This process is sequential, meaning that instructions are executed in an ordered, sequential fashion, one instruction at a time. This serial execution of instructions is a hallmark of the von Neumann computer architecture. It is in contrast to parallel processing architectures in which multiple instructions are executed in tandem. A true parallel processing computer is considered a non-von Neumann architecture machine.To summarize the main characteristics of the von Neumann architecture, it is noted that, first of all, such a computer is composed of distinct components, which are the ALU, control unit, input/output devices, and a single memory unit for storing both data and instructions (i.e., the stored-program principle). Secondly, instructions are carried out sequentially, one instruction at a time. As von Neumann himself recognized, the sequential execution of programming imposes a sort of speed limit on program execution since only one instruction at a time can be handled by the computer's processor. Computer pioneer John Backus called this the von Neumann bottleneck. This bottleneck can manifest itself when the computer's CPU processes at a rate faster than information can be delivered from main memory. There have been a plethora of techniques devised to make the most of the sequential nature that von Neumann architecture places on computers by reducing any information bottlenecks. The development of faster processors has meant that programs are executed more quickly. Processing speed has also been increased by modifying the memory side of the equation, as in the case of cache memory (which basically provides a way of transferring information from main memory into a smaller, faster memory device). Other techniques include wider data buses to carry information more quickly between memory and the CPU; reduction of wait states (i.e., reduction of the time the CPU is required to suspend processing while waiting for information from auxiliary storage); and many other speed-enhancing strategies. It must be pointed out, however, that despite these advances and enhancements one is still left with the fundamental von Neumann architecture, which is followed in the overwhelming majority of computers in use today."""
print>>open('test.txt','w'), text
dictionary = Counter((open('test.txt','r').read().split()))
print dictionary.most()[:10]
这个代码的输出结果是:
[('the', 66), ('of', 38), ('a', 29), ('to', 24), ('and', 23), ('in', 21), ('Neumann', 20), ('is', 20), ('that', 16), ('von', 15)]
这段代码的意思是打开一个名为 'test.txt' 的文件,并以只读的方式读取里面的内容。
1
我已经解决了这个问题
def main():
#user asked to enter a file name
filename = input("Enter the name of the input file:\t")
if filename == 'von_neumann.txt':
filename=open('von_neumann.txt','r')
#text file is opened for reading
else:
print('File not found')
#reads the files contents
filename_contents = filename.read()
# file closed
filename.close()
#opens dictionary.txt for writing
outfile = open('dictionary.txt','w')
#loops to count words
count = {}
for w in open('von_neumann.txt').read().split():
if w in count:
count[w] += 1
else:
count[w] = 1
for word, times in count.items():
txt=("%s was found %d times\n") % (word, times)
outfile.write (txt)
#prints the amount of unique words and then tells the user that the count
#was saved to dictionary.txt
print("There are "+str(len(count)) ,"unique words in this text")
outfile.close()
print("The dictionary was written to dictionary.txt")
main()
不过还是谢谢你的帮助 :)
1
我来补充一下其他答案的内容,不过如果你刚接触编程或者正则表达式,可能会有点难懂。最重要的是要知道,其他答案没有考虑到字母的大小写和标点符号。
举个例子:“structure,”、“structure”和“Structure”会被当作三个不同的词,每个词的值都是1,而不是一个词值为3。如果你想要这样的结果,那就太好了;但如果不是,请看下面的解决方案,它会把标点符号和大小写都去掉。
import collections
import re
reg = re.compile('[^a-zA-Z0-9 ]+')
counter = collections.Counter()
with open('countme.txt') as f:
for line in f:
clean_line = reg.sub('', line.lower().strip())
counter.update(clean_line.split())
1
我会简单地这样做:
file = open('test.txt', 'r').read().split
words = {}
for w in file:
if w not in words:
words[w] = 1
else:
words[w] += 1
不需要任何import
语句。
3
你可以使用 collections.Counter() 这个工具:
from collections import Counter
with open('test.txt', 'rb') as f:
counter = Counter()
for line in f:
counter.update(line.split())
print counter
输出结果是:
Counter({'the': 66, 'of': 38, 'a': 29, 'to': 24, 'and': 23, ... })