计数双字频率

3 投票

4 回答

17153 浏览

提问于 2025-04-16 16:58

我写了一段代码，主要是用来统计单词出现的频率，并把这些数据放进一个ARFF文件里，以便在weka这个工具中使用。我想修改一下这段代码，让它可以统计二元组的频率，也就是统计两个单词成对出现的次数，而不是单个单词。不过，我尝试了几次都没有成功。

我知道这个问题可能比较复杂，但任何帮助都会非常感谢。以下是我的代码：

    import re
    import nltk

    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

    # create list of lower case words
    word_list = re.split('\s+', file(filename).read().lower())
    print 'Words in text:', len(word_list)
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    word_list = [punctuation.sub("", word) for word in word_list]

    word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



    # create dictionary of word:frequency pairs
    freq_dic = {}


    for word in word_list2:

        # form dictionary
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1


    print '-'*30

    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    freq_list2 = [(val, key) for key, val in freq_dic.items()]
    # sort by val or frequency
    freq_list2.sort(reverse=True)
    freq_list3 = list(freq_list2)
    # display result as top 10 most frequent words
    freq_list4 =[]
    freq_list4=freq_list3[:10]

    words = []

    for item in freq_list4:
        a = str(item[1])
        a = a.lower()
        words.append(a)



    f = open(filename)

    newlist = []

    for line in f:
        line = punctuation.sub("", line)
        line = line.lower()
        newlist.append(line)

    f2 = open('Lines.txt','w')

    newlist2= []
    for line in newlist:
        line = line.split()
        newlist2.append(line)
        f2.write(str(line))
        f2.write("\n")


    print newlist2

    # ARFF Creation

    arff = open('output.arff','w')
    arff.write('@RELATION wordfrequency\n\n')
    for word in words:
        arff.write('@ATTRIBUTE ')
        arff.write(str(word))
        arff.write(' numeric\n')

    arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
    arff.write('@DATA\n')
    # Counting word frequencies for each verse
    for line in newlist2:
        word_occurrences = str("")
        for word in words:
            matches = int(0)
            for item in line:
                if str(item) == str(word):
                matches = matches + int(1)
                else:
                continue
            word_occurrences = word_occurrences + str(matches) + ","
        word_occurrences = word_occurrences + "endofworld"
        arff.write(word_occurrences)
        arff.write("\n")

    print words

数据预处理频率统计 n-gram 双字频率 ARFF格式

4 个回答

如果你开始使用NLTK的FreqDist函数来进行计数，生活会变得简单很多。此外，NLTK还有二元组的功能。关于这两者的例子可以在下面的页面找到。

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html

回答于 2025-04-16 由 Python大师

分享举报

这个内容是说，针对n-grams（就是把文本分成n个连续的词组）进行了扩展，并且可以选择性地添加一些填充内容。同时，它还使用了defaultdict(int)来记录词组出现的频率，这样可以在2.6版本中正常工作。

from collections import defaultdict

def ngrams(words, n=2, padding=False):
    "Compute n-grams with optional padding"
    pad = [] if not padding else [None]*(n-1)
    grams = pad + words + pad
    return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))

# grab n-grams
words = ['the','cat','sat','on','the','dog','on','the','cat']
for size, padding in ((3, 0), (4, 0), (2, 1)):
    print '\n%d-grams padding=%d' % (size, padding)
    print list(ngrams(words, size, padding))

# show frequency
counts = defaultdict(int)
for ng in ngrams(words, 2, False):
    counts[ng] += 1

print '\nfrequencies of bigrams:'
for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
    print c, ng

输出结果：

3-grams padding=0
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), 
 ('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'), 
 ('on', 'the', 'cat')]

4-grams padding=0
[('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'), 
 ('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'), 
 ('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]

2-grams padding=1
[(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), 
 ('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'), 
 ('the', 'cat'), ('cat', None)]

frequencies of bigrams:
2 ('the', 'cat')
2 ('on', 'the')
1 ('the', 'dog')
1 ('sat', 'on')
1 ('dog', 'on')
1 ('cat', 'sat')

回答于 2025-04-16 由 Python大师

分享举报

这段代码可以帮助你入门：

def bigrams(words):
    wprev = None
    for w in words:
        yield (wprev, w)
        wprev = w

注意，第一个二元组是 (None, w1)，其中 w1 是第一个单词，所以这个二元组是一个特殊的标记，用来表示文本的开始。如果你还想要一个表示文本结束的二元组，可以在循环后面加上 yield (wprev, None)。

回答于 2025-04-16 由 Python大师

分享举报

计数双字频率

4 个回答

撰写回答