如何为spacy的Sence2vec实现标记句子

2021-05-13 15:04:43 发布

您现在位置:Python中文网/ 问答频道 /正文

SpaCy已经实现了一个sense2vec单词嵌入包,他们记录了here

向量的形式都是WORD|POS。例如,句子

Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble

需要转换为

^{pr2}$

为了能被sense2vec预训练嵌入解释,并且是sense2vec格式。在

如何做到这一点?在

1条回答
网友
1楼 ·

基于SpaCy's bin/merge.py实现,该实现完全符合需要:

from spacy.en import English
import re

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'ENT',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}



nlp = False;
def tag_words_in_sense2vec_format(passage):
    global nlp; 
    if(nlp == False): nlp = English()
    if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore');
    doc = nlp(passage);
    return transform_doc(doc);

def transform_doc(doc):
    for index, ent in enumerate(doc.ents):
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
        #if index % 100 == 0: print ("enumerating at entity index " + str(index));
    #for np in doc.noun_chunks:
    #    while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
    #        np = np[1:]
    #    np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for index, sent in enumerate(doc.sents):
        if sent.text.strip():
            strings.append(' '.join(represent_word(w) for w in sent if not w.is_space))
        #if index % 100 == 0: print ("converting at sentence index " + str(index));
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''
def represent_word(word):
    if word.like_url:
        return '%%URL|X'
    text = re.sub(r'\s', '_', word.text)
    tag = LABELS.get(word.ent_type_, word.pos_)
    if not tag:
        tag = '?'
    return text + '|' + tag

在哪里

^{pr2}$

结果

 Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...

相关问题