使用Python进行搜索引擎查询的命名实体识别

1 投票

1 回答

53 浏览

数据工程师

提问于 2025-04-14 16:27

我正在用Python做搜索引擎查询的命名实体识别。

搜索引擎查询的一个主要特点是，它们通常是不完整的，或者都是小写字母。

为了解决这个问题，有人推荐我使用Spacy、NLTK、斯坦福NLP、Flair和Hugging Face的Transformers等工具。

我想知道在StackOverflow社区有没有人知道处理搜索引擎查询的命名实体识别的最佳方法，因为到目前为止我遇到了一些问题。

比如，使用Spacy时：

import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")

# Process a text
text = "google and apple are looking at buying u.k. startup for $1 billion"
text = "who is barack obama"
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

对于第一个查询，我得到了：

google ORG
u.k. GPE
$1 billion MONEY

这是个很好的答案。不过，对于搜索查询“who is barack obama”，因为是小写的，它没有返回任何实体。

我相信我不是第一个在Python中对搜索引擎查询进行命名实体识别的人，所以我希望能找到能给我指明方向的人。

1 个回答

问题

大多数的命名实体识别（NER）模型主要关注的是带大小写的词作为主要特征。

解决方案

我会尝试使用GPT模型，因为它们在处理遮蔽和上下文任务时进行了训练，所以应该能够根据上下文识别实体。

我用chatgpt做了一个快速实验。

提示：

Named entity recognition (NER) is a natural language processing (NLP) method that extracts information from text. NER involves detecting and categorizing important information in text known as named entities. Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages. You are an expert on recognizing Named entities. 

I will provide you short sentences and you will respond all the entities you find. 

Return the entities clasified in four types:

PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
LOC for locations such as California, Europe, 9th Avenue
ORG for organizations such as Apple, Google, UNO
MISC any other type of entity you consider that do not fits in the beforementioned cases. 

Respond in JSON format. 

For example:

"google and apple are looking at buying u.k. startup for $1 billion"

response:

{"entities": [
{"name": "google", "type": "ORG"},
{"name": "apple", "type": "ORG"},
{"name": "u.k.", "type": "MISC"}
]}

在你的使用场景中，它的反应很好（可以在chatgpt应用里试试！）

代码

以下代码和依赖项应该可以在初步尝试OpenAI模型时派上用场。

!pip install openai==1.2.0 pyautogen==0.2.0b2

（目前很难找到合适的版本组合，OpenAI最近迁移到了新的API，所以现在的教程有点乱……）

from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key="<you openAI API Key>")

# Function to perform Named Entity Recognition (NER)
def perform_ner(text):
    # Define the prompt for NER task
    prompt = """
    
    You are an expert on recognising Named entities. I will provide you short sentences and you will respond all the entities you find. Return the entities clasified in four types:
    PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
    LOC for locations such as California, Europe, 9th Avenue
    ORG for organizations such as Apple, Google, UNO
    MISC any other type of entity you consider that do not fits in the beforementioned cases. 

    Respond in JSON format. 

    For example:

    "google and apple are looking at buying u.k. startup for $1 billion"

    response:

    {"entities": [
    {"name": "google", "type": "ORG"},
    {"name": "apple", "type": "ORG"},
    {"name": "u.k.", "type": "MISC"}
    ]}
    
    """

    # Generate completion using OpenAI API
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"{prompt}"},
            {"role": "user", "content": text}
        ],
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0
    )

    # Extract and return entities from response
    
    entities = response.choices[0].message.content.strip()
    return json.loads(entities)

# Function to receive new text and return NER JSON
def get_ner_json(new_text):
    # Perform NER on the new text
    entities = perform_ner(new_text)
    return entities

# Example new text
new_text = "I went to Paris last summer and visited the Eiffel Tower."

# Get NER JSON for the new text
ner_json = get_ner_json(new_text)
print(json.dumps(ner_json, indent=2))

输出结果：

{
  "entities": [
    {
      "name": "paris",
      "type": "LOC"
    },
    {
      "name": "eiffel tower",
      "type": "LOC"
    }
  ]
}

回答于 2025-04-14 由 Python大师

分享举报

使用Python进行搜索引擎查询的命名实体识别

1 个回答

问题

解决方案

代码

撰写回答