使用Python从文本中提取人名和组织名的方法

1 投票
1 回答
1547 浏览
提问于 2025-04-17 22:35

我很好奇,有哪些准确的方法可以从文本中提取人名和组织名。我想根据合作关系等信息绘制关联网络。

我尝试了几种方法:
• 使用nltk的词性标注(POS)——这个方法太慢了,所以我放弃了。
• 使用正则表达式来匹配连续的单词,看看第一个字母是否是大写。然而,这种方法导致了很多例外和错误捕捉,很多结果并不相关(比如,有人随便把“社会创新奖”中的某个词大写了)。而且,这种方法还漏掉了只有一个单词的名字。

有没有其他人有什么好的主意呢?

文本示例:

obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha 
Piper\r\n\r\nThe award was presented during the closing dinner of the Social 
Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event 
gathered\r\nover 250 business, academic and social thought leaders from the 
social\r\nentrepreneurship sector in Canada who had convened for a full day of 
inspiration\r\nand engagement on ways to address some of the most pressing issues of our 

times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead 

an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, 

products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by 

MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit 

1 个回答

3

首先,先整理一下你的数据:

>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = [i.replace('\r\n','').strip() for i in text.split('\r\n\r')]>>> text
['obin Cardozo', 'Ed Greenspon', 'Farouk Jiwa', 'David Pecaut', 'Martha Piper', 'The award was presented during the closing dinner of the Social EntrepreneurshipSummit held at MaRS Centre for Social Innovation in Toronto. The event gatheredover 250 business, academic and social thought leaders from the socialentrepreneurship sector in Canada who had convened for a full day of inspirationand engagement on ways to address some of the most pressing issues of our times.', 'An often under-recognized community, social entrepreneurs create and lead anorganization that are aimed at catalyzing systemic social change through newideas, products, services, methodologies and changes in attitude.', 'Hosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), theCentre for Social Innovation and the Toronto City Summit Alliance, the SocialEntrepreneurship Summit']

接下来,你需要一个完整的 命名实体识别器,可以先试试 NLTK 的 ne_chunk,然后再去使用更先进的 NER 识别器:

from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.tree import Tree
from nltk import batch_ne_chunk as bnc
chunked_text = [[bnc(pos_tag(word_tokenize(j)) for j in sent_tokenize(i))] for i in text]

撰写回答