Python 2.7：使用字典制作tf-idf脚本

-1 投票

3 回答

1404 浏览

提问于 2025-04-18 18:42

我想写一个脚本，利用字典来计算tf:idf（比率）。

这个脚本的想法是通过使用os.walk来找到一个目录及其子目录下的所有.txt文件：

files = []
for root, dirnames, filenames in os.walk(directory):
    for filename in fnmatch.filter(filenames, '*.txt'):
        files.append(os.path.join(root, filename))

然后，它会用这个文件列表来找出所有的单词以及它们出现的次数：

def word_sort(filename3):
    with open(filename3) as f3:
        passage = f3.read()
    stop_words = "THE OF A TO AND IS IN YOU THAT IT THIS YOUR AS AN BUT FOR".split()
    words = re.findall(r'\w+', passage)
    cap_words = [word.upper() for word in words if word.upper() not in stop_words]
    word_sort = Counter(cap_words)
    return word_sort

term_freq_per_file = {}
for file in files:
    term_freq_per_file[file] = (word_sort(file))

最后，它会得到一个像这样的字典：

 '/home/seb/Learning/ex15_sample.txt': Counter({'LOTS': 2, 'STUFF': 2, 'HAVE': 1,
                                     'I': 1, 'TYPED': 1, 'INTO': 1, 'HERE': 1,
                                      'FILE': 1, 'FUN': 1, 'COOL': 1,'REALLY': 1}),

在我看来，这样可以让我知道每个文件中单词的出现频率。

我该如何找到实际的tf呢？

那我又该如何找到idf呢？

这里的tf指的是词频，也就是一个单词（术语）在文档中出现的次数。

TF(t) = （术语t在文档中出现的次数） / （文档中的总词数）。

而idf指的是逆文档频率，文档频率是指这个单词出现在多少个文档中。

IDF(t) = log_e（文档总数 / 包含术语t的文档数量）。

为了更清楚，我的问题是如何提取这些值并把它们放入公式中，我知道它们在那儿，但我不知道怎么提取并进一步使用它们。

我决定再创建一个字典，用来记录这个单词在哪些文件中被使用，像这样：

{word : (file1, file2, file3)}

通过像这样遍历第一个字典：

for file in tfDic:
     word = tfDic[file][Counter]
     for word in tfDic:
        if word not in dfDic.keys():
            dfDic.setdefault(word,[]).append(file)
        if word in dfDic.keys():
            dfDic[word].append(file)

问题出在这一行：

word = tfDic[file][Counter]

我以为它会“导航”到这个单词，但我注意到这些单词是Counter字典中的键，而这个字典是tfDic（文件）的一个值。

我的问题是，如何告诉它遍历这些单词（'Counter'字典的键）呢？

字典文件遍历文档处理统计分析 os.walk tf-idf 词频逆文档频率

3 个回答

除非你是在学习tf-idf是怎么工作的，不然我建议你直接使用内置的scikit-learn类来完成这个任务。

首先，创建一个包含每个文件计数字典的数组。然后把这个计数字典的数组传给DictVectorizer，接着把输出的稀疏矩阵再传给TfidfTransformer。

from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfTransformer dv = DictVectorizer() D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] X = dv.fit_transform(D) tv = TfidfTransformer() tfidf = tv.fit_transform(X) print(tfidf.to_array())

回答于 2025-04-18 由 Python大师

分享举报

(最后)

我决定回去修改我的字数计算公式，这样不再是：

word_sort = Counter(cap_words)

我已经遍历了一个单词列表，并自己做了一个字典，记录每个单词出现的次数：

word_sort = {}
for term in cap_words:
    word_sort[term] = cap_words.count(term)

所以不再每次都用一个子字典（Counter），我最终得到了这个 tfDic：

'/home/seb/Learning/ex17output.txt': {'COOL': 1,
                                   'FILE': 1,
                                   'FUN': 1,
                                   'HAVE': 1,
                                   'HERE': 1,
                                   'I': 1,
                                   'INTO': 1,
                                   'LOTS': 2,
                                   'REALLY': 1,
                                   'STUFF': 2,
                                   'TYPED': 1},

然后我遍历 tfDic[file] 的键，创建另一个字典，记录每个单词在哪些文件中被使用过：

for file in tfDic:
word = tfDic[file].keys()
for word in tfDic[file]:
    if word not in dfDic.keys():
        dfDic.setdefault(word,[]).append(file)
    if word in dfDic.keys():
        dfDic[word].append(file)

最终结果如下：

 'HERE': ['/home/seb/Learning/ex15_sample.txt',
      '/home/seb/Learning/ex15_sample.txt',
      '/home/seb/Learning/ex17output.txt'],

现在我打算直接“提取”这些值，然后把它们放入我之前提到的公式中。

回答于 2025-04-18 由 Python大师

分享举报

如果你想继续使用现在的数据结构，那么你需要对每个文件中的每个单词都仔细查看整个结构，这样才能计算出它的 idf。

# assume the term you are looking for is in the variable term
df = 0
for file in files:
    if term in term_freq_per_file[file]:
        df += 1
idf = math.log(len(files)/df)

之前这个回答里有一个关于替代数据结构的简单设计，不过现在这个方法可能已经足够用了。

回答于 2025-04-18 由 Python大师

分享举报

Python 2.7：使用字典制作tf-idf脚本

3 个回答

撰写回答