Python absum包_程序模块 - PyPI

面向数据扩充的抽象摘要

absum的Python项目详细描述

用于数据扩充的抽象摘要

Python 3.6, 3.7

简介

类分布不平衡是ML中常见的问题，欠采样和过采样是解决这一问题的两种方法。像SMOTE这样的技术对于过采样是有效的，但是对于多标签数据集，这个问题会变得更加困难。 MLSMOTE已经被提出，但是由文本创建的数值向量的高维特性有时会使其他形式的数据扩充更具吸引力。在

absum是一个NLP库，它使用抽象摘要来执行数据扩充，以便对数据集中表示不足的类进行过采样。摘要摘要的最新发展使这种方法在获得用于扩充过程的真实数据方面是最佳的。在

默认情况下，它使用最新的Huggingface T5模型，但以模块化的方式设计，允许您使用任何预先训练或现成的、能够进行抽象总结的变压器模型。 absum与格式无关，只需要包含文本和所有特性的数据帧。它还使用多处理来实现最佳性能。在

也可以使用单数摘要调用。在

算法

在
附加计数或要为每个要素添加的行数首先使用上限阈值计算。也就是说，如果一个给定的特性有1000行，而上限是100，那么它的附加计数将为0。在
在
在
对于每个特性，它将完成一个从追加索引范围到为给定特性指定的附加计数的循环。存储追加索引允许多重处理。在
在
在
抽象摘要是针对唯一具有给定特征的所有行的指定大小子集计算的。如果设置了多处理，对抽象摘要的调用将存储在任务数组中，然后传递给使用multiprocessing库并行运行调用的子例程，从而大大减少了运行时间。在
在
在
每个摘要都被附加到一个新的数据帧中，每个数据帧具有一个热编码的相应特征。在
在

安装

通过pip

pip install absum

来源

^{pr2}$

或者

pip install git+https://github.com/aaronbriel/absum.git

使用

absum需要一个数据帧，其中包含一个默认为“text”的文本列，其余的列表示一个热编码特性。如果存在您不希望被考虑的其他列，您可以选择将特定的一个热编码特性作为逗号分隔字符串传递到“features”参数。下面的参数部分详细介绍了所有可用参数。在

import pandas as pd
from absum import Augmentor

csv='path_to_csv'df= pd.read_csv(csv)augmentor= Augmentor(df, text_column='review_text')df_augmented= augmentor.abs_sum_augment()# Store resulting dataframe as a csv
df_augmented.to_csv(csv.replace('.csv', '-augmented.csv'), encoding='utf-8', index=False)

在任何文本块上运行单一摘要非常简单：

text = chunk_of_text_to_summarize
augmentor = Augmentor(min_length=100, max_length=200)
output = augmentor.get_abstractive_summarization(text)

注：运行任何摘要时，您可能会看到以下警告消息，可以忽略： “令牌索引序列长度大于为此模型指定的最大序列长度（2987>；512）。在模型中运行此序列将导致索引错误”。有关详细信息，请参阅this issue。在

参数

Name	Type	Description
df	(:class:^{}, ^{}, defaults to None)	Dataframe containing text and one-hot encoded features.
text_column	(:obj:^{}, ^{}, defaults to "text")	Column in df containing text.
features	(:obj:^{}, ^{}, defaults to None)	Comma-separated string of features to possibly augment data for.
device	(:class:^{}, ^{}, 'cuda' or 'cpu')	Torch device to run on cuda if available otherwise cpu.
model	(:class:^{}, ^{}, defaults to T5ForConditionalGeneration.from_pretrained('t5-small'))	Model used for abstractive summarization.
tokenizer	(:class:^{}, ^{}, defaults to T5Tokenizer.from_pretrained('t5-small'))	Tokenizer used for abstractive summarization.
return_tensors	(:obj:str, ^{}, defaults to "pt")	Can be set to ‘tf’, ‘pt’ or ‘np’ to return respectively TensorFlow tf.constant, PyTorch torch.Tensor or Numpy :oj: np.ndarray instead of a list of python integers.
num_beams	(:obj:^{}, ^{}, defaults to 4)	Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
no_repeat_ngram_size	(:obj:^{}, ^{}, defaults to 4	If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.
min_length	(:obj:^{}, ^{}, defaults to 10)	The min length of the sequence to be generated. Between 0 and infinity. Default to 10.
max_length	(:obj:^{}, ^{}, defaults to 50)	The max length of the sequence to be generated. Between min_length and infinity. Default to 50.
early_stopping	(:obj:^{}, ^{}, defaults to True)	bool if set to True beam search is stopped when at least num_beams sentences finished per batch. Defaults to False as defined in configuration_utils.PretrainedConfig.
skip_special_tokens	(:obj:^{}, ^{}, defaults to True)	Don't decode special tokens (self.all_special_tokens). Default: False.
num_samples	(:obj:^{}, ^{}, defaults to 100)	Number of samples to pull from dataframe with specific feature to use in generating new sample with Abstractive Summarization.
threshold	(:obj:^{}, ^{}, defaults to 3500)	Maximum ceiling for each feature, normally the under-sample max.
multiproc	(:obj:^{}, ^{}, defaults to True)	If set, stores calls to abstractive summarization in array which is then passed to run_cpu_tasks_in_parallel to allow for increasing performance through multiprocessing.
debug	(:obj:^{}, ^{}, defaults to True)	If set, prints generated summarizations.

引文

如果您在已发布或开源项目中使用此作品，请参考this library和HuggingFace pytorch-transformers库。在

欢迎加入QQ群-->： 979659372

absum 0.2.1

absum的Python项目详细描述

用于数据扩充的抽象摘要

简介

算法

安装

通过pip

来源

使用

参数

引文

推荐PyPI第三方库

pycopy-grp

zato-redis-paginator

thorp

pandas-schema

kipoi-conda

muesr

spectrum-python

login-workhours

odoo10-addon-account-type-inactive

articleDateExtractor

conditions-p

Flask-OpenTracing

cascajal

ftpservx

strawpoll.p

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

absum 0.2.1

absum的Python项目详细描述

用于数据扩充的抽象摘要

简介

算法

安装

通过pip

来源

使用

参数

引文

推荐PyPI第三方库

pycopy-grp

zato-redis-paginator

thorp

pandas-schema

kipoi-conda

muesr

spectrum-python

login-workhours

odoo10-addon-account-type-inactive

articleDateExtractor

conditions-p

Flask-OpenTracing

cascajal

ftpservx

strawpoll.p

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签