固定特遣部队输出

2024-04-19 04:02:15 发布

男 | 程序猿一只，喜欢编程写python代码。

我已经编码了特遣部队算法基于pyspark，基于MapReduce过程。我想让我的输出看起来像这样[（word，tf\u idf score），（word，tf\u idf score），（word，tf\u idf score）]。你知道吗

每个单词都应该是唯一的（虽然它在文本中出现了很多次，但最终应该只出现一次），我想通过“ReduceByKey”函数来实现这一点，但它无法工作。你知道吗

而且，现在输出显示的是字母而不是单词，因为某些原因我无法调试。你知道吗

你能解释一下我在代码中遗漏了什么吗？你知道吗

多谢了

import string
import numpy as np
list_punct=list(string.punctuation)

text = '/dbfs/FileStore/tables/full_text.txt'

text_rdd = sc.parallelize(text)

filtered_data = text_rdd. \
    map(lambda x: x.strip()). \
    filter(lambda x: len(x) != 0). \
    map(lambda punct : ''.join([txt.lower() for txt in punct if txt not in list_punct]))

number_of_docs = filtered_data.count()

doc_with_id = filtered_data.zipWithIndex()

tokenized_text = doc_with_id.map(lambda x: (x[1], x[0].split()) )

term_count = tokenized_text.flatMapValues(lambda x: x).countByValue()

term_document_count = tokenized_text.flatMapValues(lambda x: x).distinct()\
                        .map(lambda x: (x[1], x[0])).countByKey()


def tf_idf(N, term_freq, term_document_count):
    result = []
    for key, value in term_freq.items():
        doc = key[0]
        term = key[1]
        df = term_document_count[term]
        if (df>0):
            tf_idf = float(value)*np.log(number_of_docs/df)

        result.append({"doc":doc, "term":term, "score":tf_idf})
    return result

tf_idf_output = tf_idf(number_of_docs, term_count, term_document_count)
tf_idf_output[:10]

Tags： lambda text txt map doc tf count document

0条回答

目前没有回答

固定特遣部队输出

相关问题更多 >

编程相关推荐

热门问题

热门文章

固定特遣部队输出

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >