如何计算词典中单词的词频?

2024-06-16 11:31:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一本字典,如下所示:

[{'mississippi': 1, 'worth': 1, 'reading': 1}, {'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1}, {'considering': 1, 'missouri': 1, 'main': 1, 'branch': 1, 'longest': 1, 'river': 1, 'world--four': 1}, {'seems': 1, 'safe': 1, 'crookedest': 1, 'river': 1, 'part': 1, 'journey': 1, 'uses': 1, 'cover': 1, 'ground': 1, 'crow': 1, 'fly': 1, 'six': 1, 'seventy-five': 1}, {'discharges': 1, 'water': 1, 'st': 1}, {'lawrence': 1, 'twenty-five': 1, 'rhine': 1, 'thirty-eight': 1, 'thames': 1}, {'river': 1, 'vast': 1, 'drainage-basin:': 1, 'draws': 1, 'water': 1, 'supply': 1, 'twenty-eight': 1, 'states': 1, 'territories': 1, 'delaware': 1, 'atlantic': 1, 'seaboard': 1, 'country': 1, 'idaho': 1, 'pacific': 1, 'slope--a': 1, 'spread': 1, 'forty-five': 1, 'degrees': 1, 'longitude': 1}, {'mississippi': 1, 'receives': 1, 'carries': 1, 'gulf': 1, 'water': 1, 'fifty-four': 1, 'subordinate': 1, 'rivers': 1, 'navigable': 1, 'steamboats': 1, 'hundreds': 1, 'flats': 1, 'keels': 1}, {'area': 1, 'drainage-basin': 1, 'combined': 1, 'areas': 1, 'england': 1, 'wales': 1, 'scotland': 1, 'ireland': 1, 'france': 1, 'spain': 1, 'portugal': 1, 'germany': 1, 'austria': 1, 'italy': 1, 'turkey': 1, 'almost': 1, 'wide': 1, 'region': 1, 'fertile': 1, 'mississippi': 1, 'valley': 1, 'proper': 1, 'exceptionally': 1}]

我想将它更改为我的期望输出,如下所示,以计算两个目标词之间的相似性得分:

^{pr2}$

第一行是整个词典中的目标词及其频率。下面是与目标词在同一句话中的关联词及其频率。与第一本字典一样,“密西西比”的档案将包含“价值”和“阅读”的引用,它们在句子中的词频为1,但密西西比的词频在整个字典中为3。我想按降序对目标词的词频进行排序。有人能帮忙吗?


Tags: 目标字典频率词频fourfivereadingbasin
2条回答

希望下面的代码能按您需要的方式工作

file = ('sample.txt', 'r') 
file_1 = ('common.txt', 'r')
dict= {}
Orginal_data = file.read().split()
data=Orginal_data.lower() 
Common_data = file_1.read(). split ()
C_data=Common_data.lower()

for char in ',;\n': 
    data = data.replace(char,' ') 

for i in data:
     Value=0
     for j in C_data: 
          if i != j:
             Not_Equal=1
      If(Not_Equal==1):
          for k in data:
              if i ==k:
                  dict={ i : Value } # This line helps to count the appearance
                   Value+=1
print dict

无论是从你想要的输出还是从你的代码中都不清楚你到底想达到什么目的,但如果只是计算单个句子中的单词,那么策略应该是:

  1. 将您的common.txt读入set以快速查找。在
  2. 阅读你的sample.txt并在.上分开,得到单独的句子。在
  3. 清除所有非单词字符(您必须定义它们或使用regex \b来捕获单词边界)并用空格替换它们。在
  4. 按空格分割并计算步骤1中set中不存在的单词。在

所以:

import collections

with open("common.txt", "r") as f:  # open the `common.txt` for reading
    common_words = {l.strip().lower() for l in f}  # read each line and and add it to a set

interpunction = ";,'\""  # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))

sentences_counter = []  # a list to hold a word count for each sentence
with open("sample.txt", "r") as f:  # open the `sample.txt` for reading
    # read the whole file to include linebreaks and split on `.` to get individual sentences
    sentences = [s for s in f.read().split(".") if s.strip()]  # ignore empty sentences
    for sentence in sentences:  # iterate over each sentence
        sentence = sentence.translate(trans_table)  # replace the interpunction with spaces
        word_counter = collections.defaultdict(int)  # a string:int default dict for counting
        for word in sentence.split():  # split the sentence and iterate over the words
            if word.lower() not in common_words:  # count only words not in the common.txt
                word_counter[word.lower()] += 1
        sentences_counter.append(word_counter)  # add the current sentence word count

注意:在python2.x上,使用string.maketrans(),而不是{}。

这将生成sentences_counter,其中包含{}中每个句子的字典计数,其中关键字是实际单词,其关联值是单词计数。您可以将结果打印为:

^{pr2}$

它将产生(对于您的示例数据):

Sentence #1:
    area: 1
    drainage-basin: 1
    great: 1
    combined: 1
    areas: 1
    england: 1
    wales: 1
    wide: 1
    region: 1
    fertile: 1
Sentence #2:
    mississippi: 1
    valley: 1
    proper: 1
    exceptionally: 1

请记住,(英语)语言比这更复杂,例如,“一只猫在生气时扭动它的尾巴,所以远离它。“”在取决于你如何对待撇号上会有很大的不同。另外,一个点不一定表示句子的结尾。如果你想做严肃的语言分析,你应该研究一下{a1}。在

更新:虽然我看不出重复每个单词重复数据的有用性(在一个句子中计数永远不会改变),但如果您想打印每个单词并将所有其他计数嵌套在下面,则可以在打印时添加一个内循环:

for i, v in enumerate(sentences_counter):
    print("Sentence #{}:".format(i+1))
    for word, count in v.items():
        print("\t{} {}".format(word, count))
        print("\n".join("\t\t{}: {}".format(w, c) for w, c in v.items() if w != word))

这会给你:

Sentence #1:
    area 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    drainage-basin 1
        area: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    great 1
        area: 1
        drainage-basin: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    combined 1
        area: 1
        drainage-basin: 1
        great: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    areas 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    england 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        wales: 1
        wide: 1
        region: 1
        fertile: 1
    wales 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wide: 1
        region: 1
        fertile: 1
    wide 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        region: 1
        fertile: 1
    region 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        fertile: 1
    fertile 1
        area: 1
        drainage-basin: 1
        great: 1
        combined: 1
        areas: 1
        england: 1
        wales: 1
        wide: 1
        region: 1
Sentence #2:
    mississippi 1
        valley: 1
        proper: 1
        exceptionally: 1
    valley 1
        mississippi: 1
        proper: 1
        exceptionally: 1
    proper 1
        mississippi: 1
        valley: 1
        exceptionally: 1
    exceptionally 1
        mississippi: 1
        valley: 1
        proper: 1

请随意删除打印的句子编号,并减少一个制表符缩进,以便从您的问题中获得更多想要的输出。您还可以构建一个树型字典,而不是将所有内容打印到标准输出(STDOUT),如果您更喜欢的话。在

更新2:如果您愿意,您不必为common_words使用set。在本例中,它几乎可以与list互换,因此您可以使用list comprehension而不是{a3}(即用方括号替换curly),但是查看list是一个O(n)操作,而set查找是{}操作,因此这里首选set。更不用说自动重复数据消除在common.txt有重复字时的附带好处。在

至于^{}它只是为了节省一些编码/检查,只要有人请求,它就会自动将字典初始化为一个键—如果没有它,您就必须手动执行:

with open("common.txt", "r") as f:  # open the `common.txt` for reading
    common_words = {l.strip().lower() for l in f}  # read each line and and add it to a set

interpunction = ";,'\""  # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))

sentences_counter = []  # a list to hold a word count for each sentence
with open("sample.txt", "r") as f:  # open the `sample.txt` for reading
    # read the whole file to include linebreaks and split on `.` to get individual sentences
    sentences = [s for s in f.read().split(".") if s.strip()]  # ignore empty sentences
    for sentence in sentences:  # iterate over each sentence
        sentence = sentence.translate(trans_table)  # replace the interpunction with spaces
        word_counter = {}  # initialize a word counting dictionary
        for word in sentence.split():  # split the sentence and iterate over the words
            word = word.lower()  # turn the word to lowercase
            if word not in common_words:  # count only words not in the common.txt
                word_counter[word] = word_counter.get(word, 0) + 1  # increase the last count
        sentences_counter.append(word_counter)  # add the current sentence word count

更新3:如果您只想在所有句子中列出一个原始单词表,就像上次更新问题时一样,您甚至不需要考虑句子本身-只需在函数间列表中添加一个点,逐行阅读文件,在空白处拆分,并像之前一样计算单词数:

import collections

with open("common.txt", "r") as f:  # open the `common.txt` for reading
    common_words = {l.strip().lower() for l in f}  # read each line and and add it to a set

interpunction = ";,'\"."  # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))

sentences_counter = []  # a list to hold a word count for each sentence

word_counter = collections.defaultdict(int)  # a string:int default dict for counting
with open("sample.txt", "r") as f:  # open the `sample.txt` for reading
    for line in f:  # read the file line by line
        for word in line.translate(trans_table).split():  # remove interpunction and split
            if word.lower() not in common_words:  # count only words not in the common.txt
                word_counter[word.lower()] += 1  # increase the count

print("\n".join("{}: {}".format(w, c) for w, c in word_counter.items()))  # print the counts

相关问题 更多 >