文本分析软件
neolo的Python项目详细描述
新奥
Joshua Crowgey开发的Saulo Brand_o.文本分析软件 2014年夏天。
usage: neolo [-h] [--dicts DICT [DICT ...]] [--mltd] [--msttr] [--hdd]
[--verbose] [--wordlen] [--wordtypes] [--hapax] [--punc-ratio]
[--no-hyphen] [--no-apostrophe] [--sents [ABBREV]]
[--stemming LANGUAGE]
TEXT
Extract lexical statistics from a text file.
positional arguments:
TEXT the text you want to investigate
optional arguments:
-h, --help show this help message and exit
--dicts DICT [DICT ...]
a list of reference texts to compute neologism count
--mltd measure of lexical textual diversity
--msttr mean segmental type-token ratio
--hdd HD-D probabilistic TTR
--verbose, -v increase the verbosty (can be repeated: -vvv)
--wordlen, -w print the distribution of words by length
--wordtypes, -t print the distribution of wordtypes (unigrams) by
count
--hapax, -x print the list of hapax legomena
--punc-ratio, -p print the ratio of punctuation tokens out of total
tokens
--no-hyphen, -y remove the hyphen (-) from the list of punctuation
symbols used in tokenization
--no-apostrophe, -a remove the apostrophe (') from the list of punctuation
symbols used in tokenization
--sents [ABBREV], -s [ABBREV]
print sentence length statistics, uses an (optional)
abbreviations file containing stings which don't end
sentences (eg: Mr.). One abbreviaion per line, include
relevant punctuation. Note that items in the
abbreviations file will also be protected during later
tokenization.
--stemming LANGUAGE, -m LANGUAGE
stem words using NLTK prior to processing them
新词计数
此程序的名称反映了此原始功能。新词 计数是通过引用已知的单词表或词典来计算的。词类型 在所考虑的文本中找到,但在参考文献中没有找到 词典/词表被认为是新词。
为了显示一个简单的示例,假设您有一个名为mary.txt的文本文件 其中包含以下传统诗歌:
Mary had a little lamb,
Her fleece was white as snow.
Everywhere that mary went,
the lamb was sure to go.
假设您使用的是GNU/Linux的Debian发行版,那么 英语单词存储在/usr/share/dict/words中,可以用作 参考资料。你可以让neolo检查mary.txt中的新词 --dicts选项。--dicts选项接受一个或多个文件名的列表 作为计算新词的参考。
user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding: utf-8
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding: utf-8
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:
Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms: 0 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.
如您所见,mary.txt中没有不在引用中的单词 wordlist文件,所以neolo说“neologisms:0类型在1个字典中找不到”。
但是,如果您编辑mary.txt而不是fleece,那么这首诗的第二个 一行字写着“她的褶子像雪一样白”,现在neolo打印了一个新词列表。 以及它的常规输出。
user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding: utf-8
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding: utf-8
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:
pleece
Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms: 1 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.