2024-06-11 11:35:39 发布
网友
我正在寻找一个标记器,是扩大收缩。你知道吗
使用nltk将短语拆分为标记,收缩不会展开。你知道吗
nltk.word_tokenize("she's") -> ['she', "'s"]
然而,当只使用缩略映射的词典时,因此不考虑周围单词提供的任何信息,就不可能决定“she's”应该映射到“she is”还是“she has”。你知道吗
有没有提供收缩和扩张的标记器?你知道吗
您可以使用Spacy执行rule based matching,以考虑周围单词提供的信息。 我在下面编写了一些演示代码,您可以对其进行扩展以涵盖更多案例:
import spacy from spacy.pipeline import EntityRuler from spacy import displacy from spacy.matcher import Matcher sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"] nlp = spacy.load('en_core_web_sm') def normalize(sentence): ans = [] doc = nlp(sentence) #print([(t.text, t.pos_ , t.dep_) for t in doc]) matcher = Matcher(nlp.vocab) pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}] matcher.add("case_has", None, pattern) pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}] matcher.add("case_has", None, pattern) pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}] matcher.add("case_is", None, pattern) pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}] matcher.add("case_is", None, pattern) # .. add more cases matches = matcher(doc) for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] for idx, t in enumerate(doc): if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end: ans.append("has") continue if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end: ans.append("is") continue else: ans.append(t.text) return(' '.join(ans)) for s in sentences: print(s) print(normalize(s)) print()
输出:
now she's a software engineer now she is a software engineer she's got a cat she has got a cat he's a tennis player he is a tennis player He thinks that she's 30 years old He thinks that she is 30 years is old
now she's a software engineer now she is a software engineer
she's got a cat she has got a cat
he's a tennis player he is a tennis player
He thinks that she's 30 years old He thinks that she is 30 years is old
您可以使用Spacy执行rule based matching,以考虑周围单词提供的信息。 我在下面编写了一些演示代码,您可以对其进行扩展以涵盖更多案例:
输出:
相关问题 更多 >
编程相关推荐