Python textdescriptives包_程序模块 - PyPI

从文本中计算各种特征的软件包

textdescriptives的Python项目详细描述

文本说明

一个Python包，用于从文本计算各种统计信息。在

安装

克隆Github目录，在终端中导航到它，然后调用 pip install .

使用

要计算所有可能的指标：

import textdescriptives

# Input can be either a string, list of strings, or pandas Series 
en_test = ['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.']

textdescriptives.all_metrics(en_test, lang = 'en', snlp_path = snlp_path)

	Text	avg_word_length	median_word_length	std_word_length	avg_sentence_length	median_sentence_length	std_sentence_length	avg_syl_per_word	median_syl_per_word	std_syl_per_word	type_token_ratio	lix	rix	n_types	n_sentences	n_tokens	n_chars	gunning_fog	smog	flesch_reading_ease	flesch_kincaid_grade	automated_readability_index	coleman_liau_index	Germanic	Latinate	Latinate/Germanic	mean_dependency_distance	std_dependency_distance	mean_prop_adjacent_dependency_relation	std_prop_adjacent_dependency_relation
0	The world is changed.(...)	3.28571	3	1.54127	7	6	3.09839	1.08571	1	0.368117	0.657143	12.7143	0.4	24	5	35	121	3.94286	5.68392	107.879	-0.0485714	-2.45429	-0.708571	75	25	0.333333	1.60381	0.36493	0.695238	0.0481871
1	He felt that his whole (...)	4.16667	4	1.97203	24	24	0	1.16667	1	0.471405	0.833333	40.6667	4	21	1	24	101	11.2667	0	83.775	7.53667	10.195	7.46667	83.3333	16.6667	0.2	2.16	0	0.64	0

一次计算一个类别：

^{pr2}$

textDescriptions适用于大多数语言，只需更改国家/地区代码：

da_test = pd.Series(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

textdescriptives.all_metrics(da_test, lang = 'da', snlp_path=snlp_path)

如果您只需要基本统计数据的一个子集

textdescriptives.basic_stats(en_test, lang = 'en', metrics=['avg_word_length', 'n_chars'])

^{tb2}$

可读性

可读性度量很大程度上是从textstat库派生出来的，并在那里进行了彻底的定义。在

词源

词源度量是使用macroetym计算的，只是稍微重写了一下，以便从脚本中调用。他们是计算的，因为在英语中，拉丁语词源的单词频率越高，往往表示语言语域越正式。在

依赖距离

平均依存距离可以用来衡量文本的平均句法复杂度。请求snlp库。依赖距离函数需要stanfordnlp及其语言模型。如果您已经下载了这些模型，那么可以在snlp_path参数中指定文件夹的路径。否则，模型将下载到您的工作目录+/snlp_resources。在

依赖关系

根据要计算的度量值的不同，依赖关系也不同。在

基本和可读性：numpy，pandas，pyphen，pycountry
词源：nltk和以下模型 python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')"
依赖距离：snlp

指标

当前实施的指标：

基本描述性统计-以下各项的平均值、中位数、标准差：

字长
句子长度，单词
句子长度，字符（TODO）
每个单词的音节数
字符数
句子数
类型数（唯一单词）
标记数（总字数）
类型/比率
Lix公司
里克斯

可读性指标：

喷雾
烟雾
弗莱施阅读简易
弗莱施金凯级
自动可读性索引
科尔曼-利亚乌指数

词源相关指标：

日耳曼语源词百分比
拉丁语源词百分比
拉丁/日耳曼血统比率

依赖距离度量：

平均依赖距离，句子水平（平均值，标准差）
平均比例相邻依存关系，句子水平（平均值，标准偏差）

{由Hansen在

欢迎加入QQ群-->： 979659372

textdescriptives 0.1.1

textdescriptives的Python项目详细描述

文本说明

安装

使用

可读性

词源

依赖距离

依赖关系

指标

推荐PyPI第三方库

spatialist

spire-pipeline

Python-Mass-Editor

django-pip-starter

gnuhealth-archives

websauna.system

repchar

pythonping

pyqentangle

lantz-drivers

ANNarch

atd-jobs-util-dev

salaga

django-readme-generator

u1-test-utils

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

textdescriptives 0.1.1

textdescriptives的Python项目详细描述

文本说明

安装

使用

可读性

词源

依赖距离

依赖关系

指标

推荐PyPI第三方库

spatialist

spire-pipeline

Python-Mass-Editor

django-pip-starter

gnuhealth-archives

websauna.system

repchar

pythonping

pyqentangle

lantz-drivers

ANNarch

atd-jobs-util-dev

salaga

django-readme-generator

u1-test-utils

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签