Python Idmeneo-cdQa包_程序模块 - PyPI

基于sciBert的端到端封闭域问答系统

Idmeneo-cdQa的Python项目详细描述

封闭域问答

GitHub

一个端到端的封闭域问答系统。建在拥抱面图书馆之上。在

详细的cdQA

如果您对理解系统如何工作及其实现感兴趣，我们编写了一个article on Medium，并给出了一个高级解释。在

我们还在由Feedly组织的9 NLP早餐会上做了一个报告。你可以去看看here。在

安装

有pip

pip install cdqa

来源

^{pr2}$

硬件要求

已经用以下方法进行了实验：

CPUAWS EC2t2.medium深度学习AMI（Ubuntu）22.0版
GPUAWS EC2p3.2xlarge深度学习AMI（Ubuntu）版本22.0+Tesla V100 16GB。在

入门

准备数据

手动
要使用`cdQA`，您需要使用以下列创建pandas数据帧：
title paragraphs
The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]

title	paragraphs
The Article Title	[Paragraph 1 of Article, ... , Paragraph N of Article]

带转换器

cdqa转换器的目标是使从原始文档数据库创建此数据帧变得容易。例如，pdf_converter可以从包含.pdf文件的目录创建cdqa数据帧：

fromcdqa.utils.convertersimportpdf_converterdf=pdf_converter(directory_path='path_to_pdf_folder')

您需要安装Java OpenJDK才能使用此转换器。我们目前有以下转换器：

pdf格式
降价

我们计划在未来改进和增加更多的转换器。敬请期待！在

下载预先训练的模型和数据

您可以从GitHub releases手动下载模型和数据，也可以使用我们的下载功能：

fromcdqa.utils.downloadimportdownload_squad,download_model,download_bnpp_datadirectory='path-to-directory'# Downloading datadownload_squad(dir=directory)download_bnpp_data(dir=directory)# Downloading pre-trained BERT fine-tuned on SQuAD 1.1download_model('bert-squad_1.1',dir=directory)# Downloading pre-trained DistilBERT fine-tuned on SQuAD 1.1download_model('distilbert-squad_1.1',dir=directory)

训练模型

使用预先培训过的读者在你的语料库中安装管道：

importpandasaspdfromastimportliteral_evalfromcdqa.pipelineimportQAPipelinedf=pd.read_csv('your-custom-corpus-here.csv',converters={'paragraphs':literal_eval})cdqa_pipeline=QAPipeline(reader='bert_qa.joblib')# use 'distilbert_qa.joblib' for DistilBERT instead of BERTcdqa_pipeline.fit_retriever(df=df)

如果您想在自定义班组（如带注释的数据集）上微调读卡器：

cdqa_pipeline=QAPipeline(reader='bert_qa.joblib')# use 'distilbert_qa.joblib' for DistilBERT instead of BERTcdqa_pipeline.fit_reader('path-to-custom-squad-like-dataset.json')

微调后保存读卡器模型：

cdqa_pipeline.dump_reader('path-to-save-bert-reader.joblib')

做出预测

要获得给定输入查询的最佳预测：

cdqa_pipeline.predict(query='your question')

要获得N个最佳预测：

cdqa_pipeline.predict(query='your question',n_predictions=N)

也有可能改变检索器分数的权重与最终排名得分计算中的读者得分（默认值为0.35，在1.1班-公开赛的开发集上显示为最佳权重）

cdqa_pipeline.predict(query='your question',retriever_score_weight=0.35)

评估模型

为了评估自定义数据集上的模型，需要对其进行注释。注释过程可分为3个步骤：

在

将pandas数据帧转换为一个具有SQuAD格式的json文件：

fromcdqa.utils.convertersimportdf2squadjson_data=df2squad(df=df,squad_version='v1.1',output_dir='.',filename='dataset-name')

在

在
使用注释器添加基本真理问答对：
请参考我们的^{}，这是一个基于web的注释器，用于使用SQuAD格式的封闭域问答数据集。在
在

在

评估管道对象：

fromcdqa.utils.evaluationimportevaluate_pipelineevaluate_pipeline(cdqa_pipeline,'path-to-annotated-dataset.json')

在

评估读者：

fromcdqa.utils.evaluationimportevaluate_readerevaluate_reader(cdqa_pipeline,'path-to-annotated-dataset.json')

在

笔记本示例

我们在examples目录下准备了一些笔记本示例。在

您也可以使用Binder或Google Colaboratory直接使用这些笔记本示例：

^{tb2}$

Binder和googlecolaboratory提供了临时环境，启动速度可能很慢，但是如果您想轻松开始使用cdQA，我们建议您使用它们。在

部署

手动

您可以通过执行以下操作来部署cdQAREST API：

exportdataset_path=path-to-dataset.csv
exportreader_path=path-to-reader-model

FLASK_APP=api.py flask run -h 0.0.0.0

现在可以请求测试API（这里使用HTTPie）：

http localhost:5000/api query=='your question here'

如果您希望在cdQA系统之上提供一个用户界面，请执行以下操作为cdQA开发的web界面cdQA-ui的说明。在

贡献

阅读我们的Contributing Guidelines。在

参考文献

Type	Title	Author	Year
:video_camera: Video	Stanford CS224N: NLP with Deep Learning Lecture 10 â€“ Question Answering	Christopher Manning	2019
:newspaper: Paper	Reading Wikipedia to Answer Open-Domain Questions	Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes	2017
:newspaper: Paper	Neural Reading Comprehension and Beyond	Danqi Chen	2018
:newspaper: Paper	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova	2018
:newspaper: Paper	Contextual Word Representations: A Contextual Introduction	Noah A. Smith	2019
:newspaper: Paper	End-to-End Open-Domain Question Answering with BERTserini	Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin	2019
:newspaper: Paper	Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering	Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin	2019
:newspaper: Paper	Passage Re-ranking with BERT	Rodrigo Nogueira, Kyunghyun Cho	2019
:newspaper: Paper	MRQA: Machine Reading for Question Answering	Jonathan Berant, Percy Liang, Luke Zettlemoyer	2019
:newspaper: Paper	Unsupervised Question Answering by Cloze Translation	Patrick Lewis, Ludovic Denoyer, Sebastian Riedel	2019
:computer: Framework	Scikit-learn: Machine Learning in Python	Pedregosa et al.	2011
:computer: Framework	PyTorch	Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan	2016
:computer: Framework	Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.	Hugging Face	2018

许可证

Apache-2.0

欢迎加入QQ群-->： 979659372

Idmeneo-cdQa 0.0

Idmeneo-cdQa的Python项目详细描述

封闭域问答

详细的cdQA

目录

安装

有pip

来源

硬件要求

入门

准备数据

手动 要使用cdQA，您需要使用以下列创建pandas数据帧：titleparagraphsThe Article Title[Paragraph 1 of Article, ... , Paragraph N of Article]

带转换器

下载预先训练的模型和数据

训练模型

做出预测

评估模型

笔记本示例

部署

手动

贡献

参考文献

许可证

推荐PyPI第三方库

check-tier

odoo10-addon-l10n-cn-hr-payroll

data-xra

ComicsReader

chardetails

fpkem

mediamosa

rsl.upnp

LinkY

odoo10-addon-web-widget-slickroom

generates

embeddings

frozendict

hnccorr

mikado.oss.doctest_additions

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

手动
要使用`cdQA`，您需要使用以下列创建pandas数据帧：
title paragraphs
The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]

导航栏

项目链接

标签