词的二元组和排名

1 投票

2 回答

3174 浏览

提问于 2025-04-17 10:38

我正在使用这段代码来获取二元组（bigrams）的频率：

text1='the cat jumped over the dog in the dog house'
text=text1.split()

counts = defaultdict(int)
for pair in nltk.bigrams(text):
    counts[pair] +=1

for c, pair in ((c, pair) for pair, c in counts.iteritems()):
    print pair, c

输出结果是：

('the', 'cat') 1
('dog', 'in') 1
('cat', 'jumped') 1
('jumped', 'over') 1
('in', 'the') 1
('over', 'the') 1
('dog', 'house') 1
('the', 'dog') 2

我想要的是把二元组列出来，但我希望显示的不是每个单词，而是每个单词的排名。这里的“排名”是指，出现频率最高的单词排名第一，第二高的排名第二，以此类推。在这个例子中，排名是：1. the，2. dog，频率相同的单词会按频率从高到低的顺序分配排名。3. cat，4. jumped，5. over等等。

例如：

1 3 1

而不是：

('the', 'cat') 1

我认为为了做到这一点，我需要一个包含单词及其排名的字典，但我现在卡住了，不知道该怎么继续。我现在有的是：

fd=FreqDist()
ranks=[]
rank=0
for word in text:
    fd.inc(word)
for rank, word in enumerate(fd):
    ranks.append(rank+1)

word_rank = {}
for word in text:
    word_rank[word] = ranks

print ranks

数据结构二元组字典自然语言处理排名算法文本挖掘词频分析频率排序

2 个回答

这段代码是用来处理一些数据的。它的主要功能是对输入的数据进行分析和处理，最终输出结果。具体来说，它可能会读取一些文件，进行计算，或者从数据库中提取信息。代码的结构通常包括定义函数、循环和条件判断，这些都是编程中常用的基本概念。

在编写代码时，程序员会使用一些变量来存储数据，这样在后续的操作中就可以方便地使用这些数据。代码中可能还会有一些注释，帮助其他人理解每一部分的功能。

总的来说，这段代码的目的是为了让计算机能够自动完成一些重复的任务，从而提高工作效率。

text1='the cat jumped over the dog in the dog house'.split(' ')
word_to_rank={}
for i,word in enumerate(text1):
    if word not in word_to_rank:
        word_to_rank[word]=i+1

from collections import Counter
word_to_frequency=Counter(text1)

word_to_tuple={}
for word in word_to_rank:
    word_to_tuple[word]=(-word_to_frequency[word],word_to_rank[word])

tuple_to_word=dict(zip(word_to_tuple.values(),word_to_tuple.keys()))

sorted_by_conditions=sorted(tuple_to_word.keys())

word_to_true_rank={}
for i,_tuple in enumerate(sorted_by_conditions):
    word_to_true_rank[tuple_to_word[_tuple]]=i+1

def fix(pair,c):
    return word_to_true_rank[pair[0]],word_to_true_rank[pair[1]],c

pair=('the', 'cat')
c=1
print fix(pair,c)

pair=('the', 'dog')
c=2
print fix(pair,c)


>>>
(1, 3, 1)
(1, 2, 2)

回答于 2025-04-17 由 Python大师

分享举报

假设 counts 已经创建好了，下面的代码应该能得到你想要的结果：

freq = defaultdict(int)
for word in text:
    freq[word] += 1

ranks = sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
ranks = dict(zip(ranks, range(1, len(ranks)+1)))

for (a, b), count in counts.iteritems():
    print ranks[a], ranks[b], count

输出结果：

这里有一些中间值，可能会帮助你理解它是怎么工作的：

>>> dict(freq)
{'house': 1, 'jumped': 1, 'over': 1, 'dog': 2, 'cat': 1, 'in': 1, 'the': 3}
>>> sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
['the', 'dog', 'cat', 'jumped', 'over', 'in', 'house']
>>> dict(zip(ranks, range(1, len(ranks)+1)))
{'house': 7, 'jumped': 4, 'over': 5, 'dog': 2, 'cat': 3, 'in': 6, 'the': 1}

回答于 2025-04-17 由 Python大师

分享举报

词的二元组和排名

2 个回答

撰写回答