使用NLTK的FreqDist

3 投票

2 回答

11300 浏览

提问于 2025-04-16 19:12

我正在尝试用Python获取一组文档的频率分布。可是我的代码不知道为什么不工作，出现了这个错误：

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

你能帮我吗？

这是我目前的代码：

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

自然语言处理文档处理 nltk 频率分布

2 个回答

这个错误提示是说你试图把一个列表当作哈希表的键来使用。你能把它转换成一个元组吗？

回答于 2025-04-16 由 Python大师

分享举报

我有两个想法，希望能对这个问题有所帮助。

首先，关于nltk.text.Text()这个方法的说明里提到（我强调的部分）：

这是一个简单字符串序列的包装器，旨在支持对文本的初步探索（通过交互式控制台）。它的方法可以对文本的上下文进行各种分析（例如，计数、对照、搭配发现），并显示结果。如果你想写一个程序来利用这些分析，那么你应该跳过Text类，直接使用合适的分析函数或类。

所以我不太确定Text()是否适合处理你的数据。看起来用一个列表就足够了。

其次，我想提醒你考虑一下你让NLTK执行的计算。去掉停用词后再确定频率分布，这样会导致你的频率数据不准确；我不明白为什么在统计之前要去掉停用词，而不是在之后分析分布时直接忽略它们。（我觉得这个第二点更适合作为一个问题或评论，而不是答案的一部分，但我觉得指出比例会不准确是值得的。）根据你打算如何使用频率分布，这可能会是一个问题，也可能不是。

回答于 2025-04-16 由 Python大师

分享举报

使用NLTK的FreqDist

2 个回答

撰写回答