Python neolo包_程序模块 - PyPI

文本分析软件

neolo的Python项目详细描述

新奥

Joshua Crowgey开发的Saulo Brand_o.文本分析软件 2014年夏天。

usage: neolo [-h] [--dicts DICT [DICT ...]] [--mltd] [--msttr] [--hdd]
             [--verbose] [--wordlen] [--wordtypes] [--hapax] [--punc-ratio]
             [--no-hyphen] [--no-apostrophe] [--sents [ABBREV]]
             [--stemming LANGUAGE]
             TEXT

Extract lexical statistics from a text file.

positional arguments:
  TEXT                  the text you want to investigate

optional arguments:
  -h, --help            show this help message and exit
  --dicts DICT [DICT ...]
                        a list of reference texts to compute neologism count
  --mltd                measure of lexical textual diversity
  --msttr               mean segmental type-token ratio
  --hdd                 HD-D probabilistic TTR
  --verbose, -v         increase the verbosty (can be repeated: -vvv)
  --wordlen, -w         print the distribution of words by length
  --wordtypes, -t       print the distribution of wordtypes (unigrams) by
                        count
  --hapax, -x           print the list of hapax legomena
  --punc-ratio, -p      print the ratio of punctuation tokens out of total
                        tokens
  --no-hyphen, -y       remove the hyphen (-) from the list of punctuation
                        symbols used in tokenization
  --no-apostrophe, -a   remove the apostrophe (') from the list of punctuation
                        symbols used in tokenization
  --sents [ABBREV], -s [ABBREV]
                        print sentence length statistics, uses an (optional)
                        abbreviations file containing stings which don't end
                        sentences (eg: Mr.). One abbreviaion per line, include
                        relevant punctuation. Note that items in the
                        abbreviations file will also be protected during later
                        tokenization.
  --stemming LANGUAGE, -m LANGUAGE
                        stem words using NLTK prior to processing them

新词计数

此程序的名称反映了此原始功能。新词计数是通过引用已知的单词表或词典来计算的。词类型在所考虑的文本中找到，但在参考文献中没有找到词典/词表被认为是新词。

为了显示一个简单的示例，假设您有一个名为mary.txt的文本文件其中包含以下传统诗歌：

Mary had a little lamb,
Her fleece was white as snow.
Everywhere that mary went,
the lamb was sure to go.

假设您使用的是GNU/Linux的Debian发行版，那么英语单词存储在/usr/share/dict/words中，可以用作参考资料。你可以让neolo检查mary.txt中的新词 --dicts选项。--dicts选项接受一个或多个文件名的列表作为计算新词的参考。

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  0 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

如您所见，mary.txt中没有不在引用中的单词 wordlist文件，所以neolo说“neologisms:0类型在1个字典中找不到”。

但是，如果您编辑mary.txt而不是fleece，那么这首诗的第二个一行字写着“她的褶子像雪一样白”，现在neolo打印了一个新词列表。以及它的常规输出。

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:
pleece

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  1 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

mltd

msttr

HD-D

欢迎加入QQ群-->： 979659372

neolo 0.1.2

neolo的Python项目详细描述

新奥

新词计数

mltd

msttr

HD-D

推荐PyPI第三方库

git-rename-authors

td_dbf2csv

kache

cinnamon

event-bus-py2

icinga2

avocado-epigenome

paganini

secret-keeper

pifthon

phitools

ncdjango

boatmacro

sht30

norad

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

neolo 0.1.2

neolo的Python项目详细描述

新奥

新词计数

mltd

msttr

HD-D

推荐PyPI第三方库

git-rename-authors

td_dbf2csv

kache

cinnamon

event-bus-py2

icinga2

avocado-epigenome

paganini

secret-keeper

pifthon

phitools

ncdjango

boatmacro

sht30

norad

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签