Python piidetect包_程序模块 - PyPI

一个包，用于构建端到端的ml管道，以从文本中检测个人可识别信息。

piidetect的Python项目详细描述

piidetect

用于构建端到端ml管道以从文本中检测个人识别信息（pii）的包。这包装仍处于早期开发阶段。更多的文件和测试即将到来。

安装

pip install piidetect

创建假PII

py是创建混合了不同类型pii的随机文本的模块。

在python中使用

在python中创建假文本

from piidetect.fakepii import Fake_PII
fake_ = Fake_PII()
fake_.create_fake_profile(10)
train_labels, train_text, train_PII = fake_.create_pii_text_train(n_text = 5)

这个包还有一些帮助函数来创建带有文本的假pii并将其转储到磁盘。

from piidetect.fakepii import Fake_PII, write_to_disk_train, write_to_disk_test

write_to_disk_train(10)
write_to_disk_test(20)

训练数据的文件名为“train_text_with_pii_u”+convert_datetime_u下划线（datetime.now（））+“.csv” 测试数据的文件名为“test_text_with_pii_u”+convert_datetime_u下划线（datetime.now（））+“.csv”

转储的数据将包含三列：“文本”、“标签”、“PII”。文本列包含与PII混合的文本。 “标签”列包含文本的PII类型。如果文本中没有PII，则为“无”。 pii列包含真正的pii。

命令行用法

您可以将fakepii.py下载到本地目录，以便与命令行一起使用。下面是一些使用命令行的示例。

# creating 1000 training data and 100 testing data. 
python fakePII.py -train 1000 -test 100
# creating 100 testing data
python fakePII.py  -test 100
# create 1000 training data
python fakePII.py -train 1000

在训练文本中，重复使用普通文本将不同的PII插入它。在测试文本中，正常文本不会故意重复以插入不同的pii。

单词嵌入训练

这个包封装了用于检测pii的单词嵌入算法word2vec、doc2vec和fasttext。

这个单词嵌入将允许通过指定类初始化中pre_trained选项的模型。

在训练模型之后，它将把word2vec模型转储到指定给 {STR 1 } $DimpIOFFIX <强>选项（如果目录不存在，则不能转储到路径）

如果pre_train为none，则将对模型进行训练。

如果pre_train模型不是none，则默认值是在新模型上继续训练除非选项continue_train_pre_train指定为false。False选项将只指定训练前的模型是没有经过文本训练的模型。

如果re_train_new_句子为true（这是默认设置），则模型将在新句子上重新训练。这将为不在原始词汇表中的单词创建单词嵌入。这将增加模型推理时间，因为它涉及模型训练。

为了使用word2vec预测pii数据，建议使用新的句子更新模型。对于fasttext，没有必要，因为它将从字符n-grams推断。FastText培训比word2vec长得多。

size：word的矢量维度。必须与规定的预列车型号相同。

min_count：忽略总频率低于此的所有单词。使用1进行PII检测。

workers：用于培训的CPU核心数

from piidetect.pipeline import word_embedding
model = word_embedding(algo_name = "word2vec",size = 100, min_count = 1, workers =2)
model.fit(data['Text'])

如何使用piidetect构建pii检测管道。

在开始训练端到端pii检测器之前，需要创建二进制标签对于ML模型。

from piidetect.pipeline import binary_pii
data['Target'] = data['Labels'].apply(binary_pii)

这是用logistic回归建立端到端pii检测的一个例子。

from piidetect.pipeline import word_embedding, text_clean
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

logit_clf_word2vec = LogisticRegression(solver = "lbfgs", max_iter = 10000)

word2vec_pipe = Pipeline([('text_cleaning', text_clean()),
                 ("word_embedding", word_embedding(algo_name = "word2vec", workers =2)),
                 ("logit_clf_word2vec",logit_clf_word2vec)
                ])

word2vec_pipe.fit(data["Text"],data['Target'] )

也可以使用randomizedsearchcv来选择超参数。（这将运行很长时间。）

from sklearn.model_selection import RandomizedSearchCV
from piidetect.pipeline import word_embedding, text_clean
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


logit_clf_word2vec = LogisticRegression(solver = "lbfgs", max_iter = 10000)

pipe = Pipeline([('text_cleaning', text_clean()),
                 ("word_embedding", word_embedding( workers =2)),
                 ("logit_clf_word2vec",logit_clf_word2vec)
                ])


param_grid = {
    'word_embedding__algo_name':['word2vec', 'doc2vec','fasttext'],
    'word_embedding__size':[100,200,300],   
    'logit_clf_word2vec__C': uniform(0,10),
    'logit_clf_word2vec__class_weight':[{0: 0.9, 1: 0.1}, {0: 0.8, 1: 0.2}, {0: 0.7, 1: 0.3},None]
}

pipe_cv = RandomizedSearchCV(estimator = pipe,param_distributions = param_grid,\
                                      cv =10, error_score = 0,n_iter = 10 , scoring = 'f1'\
                                      ,return_train_score=True, n_jobs = 1)

训练后可以将管道转储到磁盘。compress=1将把管道保存到一个文件中。对于使用word2vec的size=300的模型，该模型可以大约为1GB。

from sklearn.externals import joblib
joblib.dump(pipe_cv.best_estimator_, 'pipe_cv.pkl', compress = 1)

欢迎加入QQ群-->： 979659372

piidetect 0.0.0.2

piidetect的Python项目详细描述

piidetect

安装

创建假PII

在python中使用

命令行用法

单词嵌入训练

如何使用piidetect构建pii检测管道。

推荐PyPI第三方库

parle

libzt

smugp

obspyh5

sentry-slack

django-yabackup

django-bittersweet

shinkenplugins.plugins.drupal_extensions

pynotice

jquer

django-laporem-field

django-nomad-country-blogs

talke

inspire-matcher

sseclient-p

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

piidetect 0.0.0.2

piidetect的Python项目详细描述

piidetect

安装

创建假PII

在python中使用

命令行用法

单词嵌入训练

如何使用piidetect构建pii检测管道。

推荐PyPI第三方库

parle

libzt

smugp

obspyh5

sentry-slack

django-yabackup

django-bittersweet

shinkenplugins.plugins.drupal_extensions

pynotice

jquer

django-laporem-field

django-nomad-country-blogs

talke

inspire-matcher

sseclient-p

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签