使用NLTK进行搭配分词

2 投票
1 回答
1086 浏览
提问于 2025-04-18 05:32

我正在使用NLTK这个工具,想要对文本进行分词处理,特别是要考虑到一些固定搭配,比如“New York”应该被当作一个整体,而不是被分成“New”和“York”两个部分。

我知道怎么找到这些固定搭配,也知道怎么进行分词,但就是不知道怎么把这两者结合起来……

谢谢。

1 个回答

1

你可以尝试一种叫做“命名实体识别”的方法,这听起来很适合你。关于这个方法,有很多资源专门讲解NLTK(自然语言工具包)在命名实体识别中的应用。我给你推荐一个例子,可以在这里找到。

from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk


def extract_entities(text):
    entities = []
    for sentence in sent_tokenize(text):
        chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
        entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])
    return entities


if __name__ == '__main__':
    text = """
A multi-agency manhunt is under way across several states and Mexico after
police say the former Los Angeles police officer suspected in the murders of a
college basketball coach and her fiancé last weekend is following through on
his vow to kill police officers after he opened fire Wednesday night on three
police officers, killing one.
"In this case, we're his target," Sgt. Rudy Lopez from the Corona Police
Department said at a press conference.
The suspect has been identified as Christopher Jordan Dorner, 33, and he is
considered extremely dangerous and armed with multiple weapons, authorities
say. The killings appear to be retribution for his 2009 termination from the
 Los Angeles Police Department for making false statements, authorities say.
Dorner posted an online manifesto that warned, "I will bring unconventional
and asymmetrical warfare to those in LAPD uniform whether on or off duty."
"""
    print extract_entities(text)

输出结果:

[Tree('GPE', [('Mexico', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Rudy', 'NNP')]), Tree('ORGANIZATION', [('Lopez', 'NNP')]), Tree('ORGANIZATION', [('Corona', 'NNP')]), Tree('PERSON', [('Christopher', 'NNP'), ('Jordan', 'NNP'), ('Dorner', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Dorner', 'NNP')]), Tree('GPE', [('LAPD', 'NNP')])]

还有一种方法是使用不同的方式来衡量两个随机变量之间的信息重叠,比如互信息、点对点互信息、t检验等。Christopher D. Manning和Hinrich Schütze的书《统计自然语言处理基础》中有很好的介绍。第五章关于搭配的内容可以下载。这链接展示了如何用NLTK提取搭配的例子。

撰写回答