为自然语言处理应用程序扩充文本的库。
textaugment的Python项目详细描述
TextAugment: Improving short text classification through global augmentation methods
textagment是一个python 3库,用于为自然语言处理应用程序扩充文本。textagment站在NLTK、Gensim和TextBlob的巨大肩膀上,和他们玩得很好。
引文
Improving short text classification through global augmentation methods发布到MLDM 2019
要求
- Python3
以下软件包是依赖项,将自动安装。
$ pip install numpy nltk gensim textblob googletrans
以下代码下载wordnet的nltk语料库。
nltk.download('wordnet')
以下代码下载NLTK tokenizer。通过使用无监督算法为缩写词、搭配词和开始句子的词建立模型,该标记赋予器将文本划分为句子列表。
nltk.download('punkt')
下面的代码下载默认的NLTK part-of-speech tagger模型。词性标记器处理一系列单词,并将词性标记附加到每个单词。
nltk.download('averaged_perceptron_tagger')
使用gensim加载预先训练的word2vec模型。就像Google News from Google drive。
importgensimmodel=gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin',binary=True)
或者使用您的数据或以下公共数据集从头开始训练一个人:
安装
从PIP安装[推荐]
$ pip install textaugment or install latest release $ pip install git+git@github.com:dsfsi/textaugment.git
从源安装
$ git clone git@github.com:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install
如何使用
有三种类型的增强可以使用:
- word2vec
fromtextaugmentimportWord2vec
- WordNet
fromtextaugmentimportWordnet
- 翻译(这需要互联网接入)
fromtextaugmentimportTranslate
基于word2vec的增强
基本示例
>>>fromtextaugmentimportWord2vec>>>t=Word2vec(model='path/to/gensim/model'or'gensim model itself')>>>t.augment('The stories are good')Thefilmsaregood
高级示例
>>>runs=1# By default.>>>v=False# verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)>>>p=0.5# The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.>>>t=Word2vec(model='path/to/gensim/model'or'gensim model itself',runs=5,v=False,p=0.5)>>>t.augment('The stories are good')Themoviesareexcellent
基于wordnet的扩充
基本示例
>>>importnltk>>>nltk.download('punkt')>>>nltk.download('wordnet')>>>fromtextaugmentimportWordnet>>>t=Wordnet()>>>t.augment('In the afternoon, John is going to town')Intheafternoon,Johniswalkingtotown
高级示例
>>>v=True# enable verbs augmentation. By default is True.>>>n=False# enable nouns augmentation. By default is False.>>>runs=1# number of times to augment a sentence. By default is 1.>>>p=0.5# The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.>>>t=Wordnet(v=False,n=True,p=0.5)>>>t.augment('In the afternoon, John is going to town')Intheafternoon,Josephisgoingtotown.
基于rtt的增强
示例
>>>src="en"# source language of the sentence>>>to="fr"# target language>>>fromtextaugmentimportTranslate>>>t=Translate(src="en",to="fr")>>>t.augment('In the afternoon, John is going to town')IntheafternoonJohngoestotown
内置on
作者
致谢
使用此库时请引用此paper。
许可证
麻省理工学院许可。有关详细信息,请参阅捆绑的LICENCE文件。