计算（并写入）文本fi中每行的词频

hotwords = (['tweet'], ['twitter']) for each line tokenize into words. for each word in line if word is equal to hotword[1], hotword1 count ++ if word is equal to hotword[2], hotword2 count ++ at end of line, for each hotword[index] filewrite count,

import nltk from nltk.corpus.reader import TaggedCorpusReader from nltk.tokenize import LineTokenizer #from nltk.tokenize import WordPunctTokenizer from collections import defaultdict # Create reader and generate corpus from all txt files in dir. filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus' filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer()) print "Reader accessible." print filereader.fileids() #define hotwords hotwords = ('cool','foo','bar') tweetdict = [] for line in filereader.sents(): wordcounts = defaultdict(int) for word in line: if word in hotwords: wordcounts[word] += 1 tweetdict.append(wordcounts)

3条回答

网友

1楼 · 编辑于 2024-04-25 22:30:04

你需要标记它吗？您可以在每一行中为每个单词使用^{}。在

hotwords = {'tweet':[], 'twitter':[]}
for line in file_obj:
    for word in hotwords.keys():
        hotwords[word].append(line.count(word))

网友

2楼 · 编辑于 2024-04-25 22:30:04

defaultdict是这类事情的朋友。在

from collections import defaultdict
for line in myfile:
    # tokenize
    word_counts = defaultdict(int)
    for word in line:
        if word in hotwords:
            word_counts[word] += 1
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())

网友

3楼 · 编辑于 2024-04-25 22:30:04

from collections import Counter

hotwords = ('tweet', 'twitter')

lines = "a b c tweet d e f\ng h i j k   twitter\n\na"

c = Counter(lines.split())

for hotword in hotwords:
    print hotword, c[hotword]

此脚本适用于Python2.7+

相关问题更多 >

编程相关推荐

热门问题

热门文章