用于同源词识别的拼写相似性度量。
spsim的Python项目详细描述
spsim是一个python 3模块,它实现了一个拼写相似性度量 用于识别跨语言的同源词,同时考虑拼写 如前所述,每一种语言对所特有的差异 在[Gomes2011]中。
注意:在下面的示例中,$表示bash提示符,并且假定是linux、macos或类似的*nix环境。
照常安装:
$ pip3 install spsim
命令行用法示例:
$ # first let's get some pairs of words that may be cognates: $ wget http://research.variancia.com/spsim/maybe_enpt.txt $ cat maybe_enpt.txt pharmacy farmácia arithmetic aritmética $ # If we don't give any example cognates, SpSim will be equivalent to $ # 1 - edit_distance / max_len_of_strings $ # Note that by default spsim matches accentuated characters, i.e. a == á $ echo "" > empty.txt $ spsim empty.txt maybe_enpt.txt pharmacy farmácia 0.5 arithmetic aritmética 0.8 $ now let's get some example cognates: $ wget http://research.variancia.com/spsim/examples_enpt.txt $ cat examples_enpt.txt alcohol álcool alpha alfa anomaly anomalia mathematics matemática methodology metodologia metric métrica morphine morfina photos fotos $ # by giving these examples to spsim, it will learn to ignore certain differences: $ spsim examples_enpt.txt maybe_enpt.txt pharmacy farmácia 1.0 arithmetic aritmética 1.0
[Gomes2011] | Measuring Spelling Similarity for Cognate Identification, Luís Gomes and Gabriel Pereira Lopes in Progress in Artificial Intelligence, 15th Portuguese Conference in Artificial Intelligence, EPIA 2011, Lisboa, Portugal, October 2011, http://www.springerlink.com/content/gtl56j3l06906020/ |