如何从CoNLL格式更改为句子列表？

3条回答

网友

1楼 · 编辑于 2024-05-14 04:03:22

最简单的方法是遍历文件的行，然后检索第一列。不需要进口

result=[[]]
with open(YOUR_FILE,"r") as input:
    for l in input:
        if not l.startswith("#"):
            if l.strip()=="":
                if len(result[-1])>0:
                    result.append([])
            else:
                result[-1].append(l.split()[0])
result=[ " ".join(row) for row in result ]

根据我的经验，手写是最有效的方法，因为CoNLL格式非常多样化（但通常是以琐碎的方式，如列的顺序），你不想为任何可以简单解决的问题而麻烦别人的代码。例如，@markusodenthal引用的代码将维护CoNLL注释（以#开头的行），这可能不是您想要的

另一件事是，自己编写循环可以让您逐句处理，而不是首先将所有内容读入数组。如果您不需要整体处理，这将更快、更具可扩展性

网友

2楼 · 编辑于 2024-05-14 04:03:22

对于NLP问题，第一个出发点是拥抱脸——对我来说永远如此——D 您的问题有一个很好的例子：https://huggingface.co/transformers/custom_datasets.html

在这里，它们显示了一个功能，正是您想要的：

from pathlib import Path
import re

def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

texts, tags = read_wnut("location/train_data.txt")

网友

3楼 · 编辑于 2024-05-14 04:03:22

您可以使用conllu库

使用pip install conllu安装

下面显示了一个示例用例

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从CoNLL格式更改为句子列表？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >