按使用频率排序单词

6 投票

3 回答

2411 浏览

数据工程师

提问于 2025-04-17 04:18

我有一份大约一万个英语单词的列表，我想根据它们在文学、报纸、博客等地方的使用频率来排序。请问我可以用Python或者其他语言来排序吗？我听说过NLTK，这是我知道的最接近的一个可以帮助我的库。或者这个任务需要用其他工具来完成吗？

谢谢你

文本处理数据排序文本分析词频统计

3 个回答

你可以使用 collections.Counter 这个工具。这样写代码就简单多了：

l = get_iterable_or_list_of_words() # That is up to you
c = collections.Counter(l)
print(c.most_common())

回答于 2025-04-17 由 Python大师

分享举报

我对自然语言处理了解不多，但我觉得Python是个很适合用来做这方面工作的语言。

我在谷歌上搜索了“Python自然语言”，找到了这个链接：

http://www.nltk.org/

在StackOverflow上搜索时，找到了这个回答：

Python还是Java更适合文本处理（文本挖掘、信息检索、自然语言处理）

这个回答还提到了Pattern：

http://www.clips.ua.ac.be/pages/pattern

你可以看看Pattern，感觉它很有潜力。

祝你好运，玩得开心！

回答于 2025-04-17 由 Python大师

分享举报

Python和NLTK是整理单词列表的绝佳工具，因为NLTK自带了一些英语语料库，你可以从中提取单词出现的频率信息。

下面的代码会按照单词在布朗语料库中的出现频率，打印出给定的wordlist：

import nltk
from nltk.corpus import brown

wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
    print w

输出结果：

>>> 
the
house
Peter
corpus
asdf

如果你想使用其他语料库或者获取更多信息，可以看看NLTK书的第二章。

回答于 2025-04-17 由 Python大师

分享举报

按使用频率排序单词

3 个回答

撰写回答