库返回几乎所有俄语单词的单词频率(ipm)
ruword-frequenc的Python项目详细描述
说明
python库ruword_frequency
返回俄语单词的频率(ipm-items/million),不区分大小写。
它基于大量的俄语文档和准备的词频来源。完整列表:
- Wikipedia dump, russian segment
- Flibusta dump,超过200gb的文本
- Pyhlyi's library
- Новый частотный словарь русской лексики
- Словарь русской литературы来自http://speakrus.ru/dict/index.htm
- Частотный словарь Марка фон Хагена见description
从所有枚举源中提取word的ipm并使用平均值。 完整的索引包含了70亿个单词,其中包括来自原始数据源的错误(不幸的是)。
要求:
- Python3
- word索引在硬盘上占据了将近50mb的空间,并且将在您第一次调用
frequency.load()
方法时被下载
安装
# TODO
用法
from ruword_frequency import Frequency
freq = Frequency()
freq.load()
freq.ipm('привет')
>>> 53.51823806762695
freq.ipm('неттакогослова')
>>> 0.0
# get max ipm value. For weights normalization, for example
freq.max_ipm()
>>> 42329.2890625
# get list of most used words with ipm more then 10000
for w in freq.iterate_words(10000):
print(w)
有关其他有用的方法,请参见marisa-trie文档。
树索引可用作freq.tree
自行重建树
from ruword_frequency.source_reader import SourceReader
reader = SourceReader()
# increase socket timeout, sometimes helpful for huge file downloading:
import socket
socket.setdefaulttimeout(60)
reader.download_all_sources()
tree = reader.build_tree_from_dictionaries()
reader.save_tree(tree)
# use it
freq = Frequency()
freq.ipm('привет')