计算输入文件中句子、单词和字符数量的代码

1 投票
4 回答
13166 浏览
提问于 2025-04-16 12:24

我写了以下代码,用来统计输入文件sample.txt中的句子、单词和字符的数量,这个文件里有一段文字。代码在统计句子和单词数量上运行得很好,但在统计字符数量(不包括空格和标点符号)时却不太准确。

lines,blanklines,sentences,words=0,0,0,0
num_chars=0

print '-'*50

try: filename = 'sample.txt' textf = open(filename,'r') except IOError: print '无法打开文件 %s 进行读取' % filename import sys sys.exit(0)

for line in textf: print line lines += 1 if line.startswith('\n'): blanklines += 1 else:

    sentences += line.count('.')+ line.count ('!')+ line.count('?')

    tempwords = line.split(None)
    print tempwords 
    words += len(tempwords)

textf.close()

print '-'*50 print "行数:", lines print "空行数:", blanklines print "句子数:", sentences print "单词数:", words

import nltk import nltk.data import nltk.tokenize

with open('sample.txt', 'r') as f: for line in f: num_chars += len(line)

num_chars = num_chars - (words + 1)

pcount = 0 from nltk.tokenize import TreebankWordTokenizer with open('sample.txt','r') as f1: for line in f1: #tokenised_words = nltk.tokenize.word_tokenize(line) tokenizer = TreebankWordTokenizer() tokenised_words = tokenizer.tokenize(line) for w in tokenised_words: if ((w=='.')|(w==';')|(w=='!')|(w=='?')): pcount = pcount + 1 print "标点符号数量:", pcount num_chars = num_chars - pcount print "字符数量:", num_chars

pcount是标点符号的数量。有人能建议我需要做哪些修改,以便准确找到不包含空格和标点符号的字符数量吗?

4 个回答

0

你可以做的一件事是,当你读取这一行的时候,逐个字符地遍历它,并统计字符的数量:

for character in line:
    if character.isalnum():
        num_chars += 1

另外,你可能需要根据自己的具体需求来修改if语句的条件,比如说如果你想要统计$符号的话。

1

你也可以用正则表达式来替换掉所有不是字母和数字的字符,然后再计算每一行的字符数量。

2
import string

#
# Per-line counting functions
#
def countLines(ln):      return 1
def countBlankLines(ln): return 0 if ln.strip() else 1
def countWords(ln):      return len(ln.split())

def charCounter(validChars):
    vc = set(validChars)
    def counter(ln):
        return sum(1 for ch in ln if ch in vc)
    return counter
countSentences = charCounter('.!?')
countLetters   = charCounter(string.letters)
countPunct     = charCounter(string.punctuation)

#
# do counting
#
class FileStats(object):
    def __init__(self, countFns, labels=None):
        super(FileStats,self).__init__()
        self.fns    = countFns
        self.labels = labels if labels else [fn.__name__ for fn in countFns]
        self.reset()

    def reset(self):
        self.counts = [0]*len(self.fns)

    def doFile(self, fname):
        try:
            with open(fname) as inf:
                for line in inf:
                    for i,fn in enumerate(self.fns):
                        self.counts[i] += fn(line)
        except IOError:
            print('Could not open file {0} for reading'.format(fname))

    def __str__(self):
        return '\n'.join('{0:20} {1:>6}'.format(label, count) for label,count in zip(self.labels, self.counts))

fs = FileStats(
    (countLines, countBlankLines, countSentences, countWords, countLetters, countPunct),
    ("Lines",    "Blank Lines",   "Sentences",    "Words",    "Letters",    "Punctuation")
)
fs.doFile('sample.txt')
print(fs)
Lines                   101
Blank Lines              12
Sentences                48
Words                   339
Letters                1604
Punctuation             455

结果是

撰写回答