Python public-meetings包_程序模块 - PyPI

公开会议的集合

public-meetings的Python项目详细描述

`public_meetings`：由成对的转录和报告组成的会议语料库

paper review正在进行中，请联系pltrdy@gmail.com获取更多信息

另请参见https://github.com/pltrdy/autoalign

入门

来自pip:

pip install public_meetings

，来源：

^{pr2}$

关于

此语料库包含会议，由（a）来自录音的自动转录，（b）由专业人士撰写的会议报告组成。
由于这两个文本都太长，无法进行合理的处理（例如通过神经模型），因此我们致力于自动分割和对齐，以获得适合会议摘要训练/评估的配对。在

我们在这个存储库中展示我们的数据的公共摘录。分段/对齐可以在https://github.com/pltrdy/autoalign找到。在

读取数据

我们提供22个可轻松加载的对齐会议：

import public_meetings

meetings = public_meetings.load_meetings()

会议由散列标识，例如：

meeting = meetings['81540075987931464031780e046c0d8f']

每个会议首先自动对齐meeting['initial']，然后由人工注释器meeting['final']进行后期编辑。每个排列都有一个转录（又名。ctm）和报告端（aka。doc）包含片段（通常是几个句子）。在

meeting['final']['doc'][i]['text']      # text of the i-th document segment
meeting['final']['doc'][i]['id']        # id of the i-th document segment

meeting['final']['ctm'][j]['text']      # text of the j-th transcription segment
meeting['final']['ctm'][j]['id']        # id of the j-th transcription segment
meeting['final']['ctm'][j]['aligned']   # doc segment id corresponding to the j-th transcription segment

命令

public_meetings从单个入口点提供实用函数：

public_meetings [command]

以下部分列出了命令。在

`prepare`：将所有会议处理为src/tgt文件。

prepare命令用于为摘要模型（用于培训或推理）准备会议。
它基本上加载每个会议，并在[prefix].src.txt文件中写入转录端，在[prefix].tgt.txt文件中写入报告端。许多参数可以设置为根据单词/句子的数量（最小值和最大值）过滤段。在

论文中的示例：

./prepare.py \
    -mw 10 -Mw 1000 \
    -ms 3 -Ms 50 \
    -overlap_prct 0 -n_draw 0 \
    -remove_unk \
    -sentence_tag \
    -remove_names \
    -remove_headers \
    -remove_p

完整用法：

public_meetings prepare -h
usage: prepare [-h] [-dir DIR] [-mw MW] [-Mw MW] [-ms MS] [-Ms MS]
               [-remove_tags] [-remove_unks] [-remove_names] [-remove_headers]
               [-remove_p] [-sentence_tags] [-overlap_prct OVERLAP_PRCT]
               [-n_draw N_DRAW] [-output OUTPUT] [-verbose]

optional arguments:
  -h, --help            show this help message and exit
  -dir DIR, -d DIR      Aligned meeting root
  -mw MW                Min #words
  -Mw MW                Max #words
  -ms MS                Min #sentences
  -Ms MS                Max #sentences
  -remove_tags          Remove every tags i.e. <*>
  -remove_unks          Remove unknown tags i.e. <unk>
  -remove_names         Remove names i.e. <nom>*</nom>
  -remove_headers       Remove headers i.e. <h>*</h>
  -remove_p             Remove paragraph tags i.e. <p> and </p>
  -sentence_tags        And sentence separators <t> and </t>
  -overlap_prct OVERLAP_PRCT, -oprct OVERLAP_PRCT
  -n_draw N_DRAW
  -output OUTPUT        Output path prefix
  -verbose, -v

`segmentation`：以线性分割方式处理转录侧。

我们在进行线性分割实验之前使用这个方法。它只考虑会议的转录部分，并将其写入source（每行一个片段）和reference（每行一个片段+分段分隔符==========）。在

您只需设置一个output_root目录来接收文本文件，并且可以选择另一个meeting_root

示例：

public_meetings segmentation -o ./public_meetings_txt

完整用法：

public_meetings segmentation -h
usage: segmentation [-h] [-meeting_root MEETING_ROOT] -output_root OUTPUT_ROOT

optional arguments:
  -h, --help            show this help message and exit
  -meeting_root MEETING_ROOT, -m MEETING_ROOT
                        Meeting root directory
  -output_root OUTPUT_ROOT, -o OUTPUT_ROOT
                        Output root directory

欢迎加入QQ群-->： 979659372

public-meetings 0.1.0rc3

public-meetings的Python项目详细描述

`public_meetings`：由成对的转录和报告组成的会议语料库

入门

关于

读取数据

命令

`prepare`：将所有会议处理为src/tgt文件。

`segmentation`：以线性分割方式处理转录侧。

推荐PyPI第三方库

TOPSIS-Rahet-101803173

pd-auto-ml

execshell

gcld3

pymail-io

strong

typelint

BMJV

galaxie-clans-keeper

ttooll

way2package

certbot-dns-lightsail

changerelease

mocktail

chcli

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

public-meetings 0.1.0rc3

public-meetings的Python项目详细描述

public_meetings：由成对的转录和报告组成的会议语料库

入门

关于

读取数据

命令

prepare：将所有会议处理为src/tgt文件。

segmentation：以线性分割方式处理转录侧。

推荐PyPI第三方库

TOPSIS-Rahet-101803173

pd-auto-ml

execshell

gcld3

pymail-io

strong

typelint

BMJV

galaxie-clans-keeper

ttooll

way2package

certbot-dns-lightsail

changerelease

mocktail

chcli

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

`public_meetings`：由成对的转录和报告组成的会议语料库

`prepare`：将所有会议处理为src/tgt文件。

`segmentation`：以线性分割方式处理转录侧。

导航栏

项目链接

标签