计算词汇多样性的简单程序
lexical-diversit的Python项目详细描述
使用pip安装:
pip install lexical-diversity
开始:
>>> from lexical_diversity import lex_div as ld
预处理文本:
为了方便起见,用户可以使用tokenize()函数或通过使用预定义的tokenize函数(例如,来自nltk)来标记文本:
>>> text = """The state was named for the Colorado River, which Spanish travelers named the Río Colorado for the ruddy silt the river carried from the mountains. The Territory of Colorado was organized on February 28, 1861, and on August 1, 1876, U.S. President Ulysses S. Grant signed Proclamation 230 admitting Colorado to the Union as the 38th state. Colorado is nicknamed the "Centennial State" because it became a state a century after the signing of the United States Declaration of Independence. Colorado is bordered by Wyoming to the north, Nebraska to the northeast, Kansas to the east, Oklahoma to the southeast, New Mexico to the south, Utah to the west, and touches Arizona to the southwest at the Four Corners. Colorado is noted for its vivid landscape of mountains, forests, high plains, mesas, canyons, plateaus, rivers, and desert lands. Colorado is part of the western or southwestern United States, and one of the Mountain States. Denver is the capital and most populous city of Colorado. Residents of the state are known as Coloradans, although the antiquated term "Coloradoan" is occasionally used."""
>>> tok = ld.tokenize(text)
>>> print(tok[:10])
['the', 'state', 'was', 'named', 'for', 'the', 'colorado', 'river', 'which', 'spanish']
为了方便起见,您还可以使用简单的flemmatize()函数对文本进行柠檬化,该函数不是特定于语音的部分(“run”作为名词,而“run”作为动词被视为同一个词)。但是,最好使用对词性敏感的柠檬酸盐(例如,使用spacy)。
>>> flt = ld.flemmatize(text)
>>> print(flt[:10])
['the', 'state', 'be', 'name', 'for', 'the', 'colorado', 'river', 'which', 'spanish']
计算词汇多样性
简单ttr
>>> ld.ttr(flt)
0.5777777777777777
根ttr
>>> ld.root_ttr(flt)
7.751702321999271
记录ttr
>>> ld.log_ttr(flt)
0.8943634681549878
质量ttr
>>> ld.maas_ttr(flt)
0.04683980831849556
平均节段ttr(msttr)
默认情况下,段大小为50字。但是,这可以使用window_length参数定制。
>>> ld.msttr(flt)
0.7133333333333333
>>> ld.msttr(flt,window_length=25)
0.7885714285714285
移动平均ttr(mattr)
默认情况下,窗口大小为50字。但是,这可以使用window_length参数定制。
>>> ld.mattr(flt)
0.7206106870229007
>>> ld.mattr(flt,window_length=25)
0.7961538461538458
超几何分布d(hdd)
根据McCarthy和Jarvis(2007年和2010年),VOCD的更直接和可靠的实现(Malvern、Richards、Chipere和Duran,2004年)。
>>> ld.hdd(flt)
0.7346993253061275
H3>词汇语篇多样性(MTLD)< EH3>根据麦卡锡和贾维斯(2010)计算MTLD。
ld.mtld(flt)
36.50595044690307
H3>词汇语篇多样性测度(移动平均数,包络)< EH3>使用移动窗口方法计算mtld。它不计算分项系数,而是换行到文本的开头,以完成最后一个因子。
ld.mtld_ma_wrap(flt)
33.68333333333333
H3>词汇语篇多样性测度(移动平均、双向)< EH3>通过使用移动窗口方法计算每个方向的平均mtld分数。
ld.mtld_ma_bid(flt)
35.46626265150569