一种用于提取和搜索语言特征的语言工具
LFExtractor的Python项目详细描述
灵逖特征提取器
说明
- 一种语料库语言工具,用于提取和搜索文本或语料库中的语言特征。在
- 主要版本有95个内置语言特征,而论文项目版本有98个内置语言特征。删除特征是每句话的单词数、话语数和重叠数,这些特征在正常语料库中不被认为是可访问的。在
- 超过2/3的这些特征来自于Biber等人(2006),其中42个特征也出现在Biber(1988)中。这些特性通常被称为多维(MD)分析框架的一部分。在
- 该程序主要在两个可在线访问的语料库上测试,即British Academic Spoken Corpus和Michigan Corpus of Academic Englush,但出于版权考虑,在这里它是在test_sample上测试的。在
先决条件
Computer Langauges
:- python3.6+:用命令检查:
python --version
或{}(Download Page) - java1.8+:用命令检查:“Java--version”(Download Page)。在
- python3.6+:用命令检查:
Python packages
Package | Description | Pip download |
---|---|---|
stanfordcorenlp | A Python wrapper for StanforeCoreNLP | ^{ |
pandas | Used for storing extracted feature frequencies | ^{ |
此外,程序中大量使用内置包,尤其是正则表达式的内置re
包。在
安装
- 直接从这个页面和cd下载到项目文件夹。在
- 通过pip:
pip/pip3 install LFExtractor
使用
通向StanfordCoreNLP的路径
请在文本中指定到StanfordCoreNLP的目录_处理器.py第一次使用程序时在LFE文件夹下。
- [十]
nlp = StanfordCoreNLP("/path/to/StanfordCoreNLP/")
示例:nlp=StanfordCoreNLP(“/Users/wzx/p_包/stanford-corenlp-4.1.0”)
处理一组文件
fromLFE.extractorimportCorpusLFElfe=CorpusLFE('/directory/to/the/corpus/under/analysis/')# get frequency data and tagged corpus and extracted features by defaultlfe.corpus_feature_fre_extraction()lfe.corpus_feature_fre_extraction()# lfe.corpus_feature_fre_extraction(normalized_rate=100, save_tagged_corpus=True, save_extracted_features=True, left=0, right=0). # change the normalized_rate, trun off tagged text and leave extracted text with specified context to displaylfe.corpus_feature_fre_extraction(1000,False,True,2,3)# extract frequency data only, and the data are normalized at 1000 words. # get frequency data onlylfe.corpus_feature_fre_extraction(save_tagged_corpus=False,save_extracted_features=False)# get tagged corpus onlylfe.save_tagged_corpus()# get extracted feature onlylfe.save_corpus_extracted_features()# lfe.save_corpus_extracted_features(left=0, right=0)# set how many words to display besides the target patternlfe.save_corpus_extracted_features(2,3)# extract and save specific linguistic feature by feature name# to see the built-in features' names, use `show_feature_names()`fromLFE.extractorimport*print(show_feature_names())# Six letter words and longer, Contraction, Agentless passive, By passive...# specify which feature to extract and savelfe.save_corpus_one_extracted_feature_by_name('Six letter words and longer')# extract and save specific linguistic feature by feature regex, for example, 'you know' lfe.save_corpus_one_extracted_feature_by_regex(r'you_\S+ know_\S+',2,2,feature_name='You Know')# Extract phrase 'you know' along with 2 words spanning around. Also remember the '_\S+' at the end of each word since the corpus will be automatically POS tagged.# for more complex structure, the features_set.py can be ultilized, for example, to extract "article + adj + noun" structurefromLFEimportfeatures_setasfsART=fs.ARTADJ=fs.ADJNOUN=fs.NOUNlfe.save_corpus_one_extracted_feature_by_regex(rf'{ART}{ADJ}{NOUN}',2,2,'Noun phrase')# result example (use test_sample): away_RB by_IN 【 the_DT whole_JJ thing_NN 】 In_IN fact_NN
处理文本
^{pr2}$处理语料库的一部分
fromLFE.extractorimport*lfe=CorpusLFE('/directory/to/the/corpus/under/analysis/')# get_filepath_list and select the files you want to examine and construct a listfp_list=lfe.get_filepath_list()# loop through the list and use the functionalities mentioned above to get the results you want
- 项目
标签: