使用Python中NLTK的条件频率分布计算语料库中的单词总数（新手）

cfd_appr = nltk.ConditionalFreqDist( (textname, num_appr) for textname in corpus.fileids() for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2022.12.06_Bild 2.txt 3 36 109 40 47 43 29 29 33 23 24 12 8 6 4 2 2 0 0 0 0 2022.12.06_Bild 3.txt 2 42 129 59 57 46 46 35 22 24 17 21 13 5 6 6 2 2 2 0 0 2022.12.06_Bild 4.txt 3 36 106 48 43 32 38 30 19 39 15 14 16 6 5 8 3 2 3 1 0 2022.12.06_Bild 5.txt 1 55 162 83 68 72 46 24 34 38 27 16 12 8 8 5 9 3 1 5 1 2022.12.06_Bild 6.txt 7 69 216 76 113 83 73 52 49 42 37 20 19 9 7 5 3 6 3 0 1 2022.12.06_Bild 8.txt 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2022.12.06_Bild 2.txt 451.0 2022.12.06_Bild 3.txt 538.0 2022.12.06_Bild 4.txt 471.0 2022.12.06_Bild 5.txt 679.0 2022.12.06_Bild 6.txt 890.0 2022.12.06_Bild 8.txt 3.0 dtype: float64

1 2022.12.06_Bild 2.txt 451.0 2022.12.06_Bild 3.txt 538.0 2022.12.06_Bild 4.txt 471.0 2022.12.06_Bild 5.txt 679.0 2022.12.06_Bild 6.txt 890.0 2022.12.06_Bild 8.txt 3.0

2条回答

网友

1楼 · 编辑于 2024-05-29 00:11:03

让我们首先尝试使用臭名昭著的BookCorpus复制表，并使用目录结构：

/books_in_sentences
   books_large_p1.txt
   books_large_p2.txt

代码：

from nltk.corpus import PlaintextCorpusReader
from nltk import ConditionalFreqDist
from nltk import word_tokenize

from collections import Counter

import pandas as pd

corpus = PlaintextCorpusReader('books_in_sentences/', '.*')

cfd_appr = ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in 
                     word_tokenize(corpus.raw(fileids=textname))])

然后熊猫的咀嚼部分：

# Idiom to convert a FreqDist / ConditionalFreqDist into pd.DataFrame.
df = pd.DataFrame([dict(Counter(freqdist)) 
                   for freqdist in cfd_appr.values()], 
                 index=cfd_appr.keys())
# Fill in the not-applicable with zeros.
df = df.fillna(0).astype(int)

# If necessary, sort order of columns and add accordingly.
df = df.sort_values(list(df))

# Sum all columns per row -> pd.Series
counts_per_row = df.sum(axis=1)

最后，要访问索引系列，例如：

print('books_large_p1.txt', counts_per_row['books_large_p1.txt'])

或者

我鼓励使用上面的解决方案，这样您就可以使用DataFrame进一步操作数字，但是如果您真正需要的只是每行的列数，那么请尝试以下方法

如果需要避免使用pandas并直接使用CFD中的值，那么您必须使用ConditionalFreqDist.values()并仔细遍历它

如果我们这样做：

>>> list(cfd_appr.values())
[FreqDist({3: 6, 6: 5, 1: 5, 9: 4, 4: 4, 2: 3, 8: 2, 10: 2, 7: 1, 14: 1}),
 FreqDist({4: 10, 3: 9, 1: 5, 7: 4, 2: 4, 5: 3, 6: 3, 11: 1, 9: 1})]

我们将看到一个FreqDist列表，每个都对应于键（在本例中为文件名）：

>>> list(cfd_appr.keys())
['books_large_p1.txt', 'books_large_p2.txt']

因为我们知道FreqDist is a subclass of collections.Counter object，如果我们对每个计数器对象的值求和，我们将得到：

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

输出与上面df.sum(axis=1)相同的值

总而言之：

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}

网友

2楼 · 编辑于 2024-05-29 00:11:03

好吧，这里是实际需要的：

首先，获取不同长度的单词数（就像我之前做的那样）：

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

然后将importDataFrame添加为pd，并将to_frame(1)添加到我通过对列求和得到的dtype: float64序列中：

pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)

就这样。然而，如果有人知道如何在cfd_appr的定义中求和，那将是更优雅的解决方案

或者

相关问题更多 >

编程相关推荐

热门问题

热门文章