文本分析软件

neolo的Python项目详细描述


新奥

Joshua Crowgey开发的Saulo Brand_o.文本分析软件 2014年夏天。

usage: neolo [-h] [--dicts DICT [DICT ...]] [--mltd] [--msttr] [--hdd]
             [--verbose] [--wordlen] [--wordtypes] [--hapax] [--punc-ratio]
             [--no-hyphen] [--no-apostrophe] [--sents [ABBREV]]
             [--stemming LANGUAGE]
             TEXT

Extract lexical statistics from a text file.

positional arguments:
  TEXT                  the text you want to investigate

optional arguments:
  -h, --help            show this help message and exit
  --dicts DICT [DICT ...]
                        a list of reference texts to compute neologism count
  --mltd                measure of lexical textual diversity
  --msttr               mean segmental type-token ratio
  --hdd                 HD-D probabilistic TTR
  --verbose, -v         increase the verbosty (can be repeated: -vvv)
  --wordlen, -w         print the distribution of words by length
  --wordtypes, -t       print the distribution of wordtypes (unigrams) by
                        count
  --hapax, -x           print the list of hapax legomena
  --punc-ratio, -p      print the ratio of punctuation tokens out of total
                        tokens
  --no-hyphen, -y       remove the hyphen (-) from the list of punctuation
                        symbols used in tokenization
  --no-apostrophe, -a   remove the apostrophe (') from the list of punctuation
                        symbols used in tokenization
  --sents [ABBREV], -s [ABBREV]
                        print sentence length statistics, uses an (optional)
                        abbreviations file containing stings which don't end
                        sentences (eg: Mr.). One abbreviaion per line, include
                        relevant punctuation. Note that items in the
                        abbreviations file will also be protected during later
                        tokenization.
  --stemming LANGUAGE, -m LANGUAGE
                        stem words using NLTK prior to processing them

新词计数

此程序的名称反映了此原始功能。新词 计数是通过引用已知的单词表或词典来计算的。词类型 在所考虑的文本中找到,但在参考文献中没有找到 词典/词表被认为是新词。

为了显示一个简单的示例,假设您有一个名为mary.txt的文本文件 其中包含以下传统诗歌:

Mary had a little lamb,
Her fleece was white as snow.
Everywhere that mary went,
the lamb was sure to go.

假设您使用的是GNU/Linux的Debian发行版,那么 英语单词存储在/usr/share/dict/words中,可以用作 参考资料。你可以让neolo检查mary.txt中的新词 --dicts选项。--dicts选项接受一个或多个文件名的列表 作为计算新词的参考。

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  0 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

如您所见,mary.txt中没有不在引用中的单词 wordlist文件,所以neolo说“neologisms:0类型在1个字典中找不到”。

但是,如果您编辑mary.txt而不是fleece,那么这首诗的第二个 一行字写着“她的褶子像雪一样白”,现在neolo打印了一个新词列表。 以及它的常规输出。

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:
pleece

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  1 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

mltd

msttr

HD-D

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java对ServiceListener和ServiceTracker调用提供了哪些排序保证?   java找不到方法格式的符号(DateTimeFormatter)?   mysql有没有一种方法可以将TCPDump输出到一个文件中,并用Java对其进行过滤,每5秒钟用新数据覆盖一次该文件?   java如何最好地配置用户上传支持文件的上传位置   java我在Android上使用OData4j,我无法获取实体   JPA实体关系简单示例中的java获取错误   JAVANoClassDefFoundError:安卓。应用程序。用法安卓中的UsageStatsManager   Eclipse中javaoo代码分析   java MethodVisitor抛出类格式错误   java为什么在从ViewModel调用时,改型排队不起作用?   调试小程序Java控制台:删除跟踪消息大小限制   java复杂安卓活动动画   java如何在使用JDOM2解析XML时忽略注释内容   java通过循环创建文本字段   即使在bufferedwriter关闭后也未发现java文件异常   单链表恢复中的java错误