pythonnltk解释一个固定的句子模式并标记i

2条回答

网友

1楼 · 编辑于 2024-05-15 11:58:18

在计算语言学中，这被称为“Named Entity Recognition”，它是从文本中识别组织、人和地点等事物的过程。在

这里的挑战是nltk中默认的NE chunker是在ACE corpus上训练的最大熵chunker。它还没有被训练识别日期和时间，所以你需要调整它，找到一种检测时间的方法。在

有一些软件包可以帮助提取命名实体，Stanford-NER（Named Entity Recognizer）是目前最流行的命名实体识别工具之一，由Java实现。但是您可以通过下载包来使用它，并通过NLTK进行交互，NLTK提供了Stanford-NER的接口。在

您可以下载Stanford Named Entity Recognizer version 3.4 你在哪里找到斯坦福大学-内贾尔和分类器模型“全部”类distsim.crf.gz系列““

from nltk.tag.stanford import NERTagger
def stanfordNERExtractor(sentence):
    st =  NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar')
    return st.tag(sentence.split()) 

stanfordNERExtractedLines = stanfordNERExtractor("New York")
print stanfordNERExtractedLines #[('New-York', 'LOCATION')]

你也可以使用NTLK，你可以在official document上找到更多细节，检查一下这个要点from Gavin

^{pr2}$

我们如何确定目的地？在区分了位置之后，您可能会面临识别用空格分隔的单词，或者区分来源和区别的问题。在

最好编写一个正则表达式模式来标识源和目标。您可能在获取其他单词（如"to get"）时遇到问题，但是您已经确定了要从st.tag（“LOCATION”）验证的位置列表，或者，如果您使用了NTLK，您可以验证它是否是动词（“VB”/“NN”）。您还可以通过使用NLTK的UnigramTagger（）和BigramTagger（）来检查可能性，以便在“FROM”和“to”之后获取可以标识为位置的名称

import re
text= "I want to go to New York from Atlanta, business class, on 25th July."
destination= re.findall(r'.to.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text)
source= re.findall(r'.from.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text)

print source,destination

我们如何确定时间/日期？在

如上所述，这是我们可以面对的问题之一，但我们可以使用正则表达式，如本文thread所述。在

print re.findall(
    r"""(?ix)             # case-insensitive, verbose regex
    \b                    # match a word boundary
    (?:                   # match the following three times:
     (?:                  # either
      \d+                 # a number,
      (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
      |                   # or a month name
      (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
     )
     [\s./-]*             # followed by a date separator or whitespace (optional)
    ){3}                  # do this three times
    \b """, 
    text)

输出：

25th July 2014.

我们也可以使用python-dateutil或this，而不是使用正则表达式。在

以防丢失部分，如年份或月份。我们可以使用parsedatetime包对此进行调整。在

检查这个快速的例子（你可以根据不同的场景调整它）

>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> print p.parse("25th this month")
(time.struct_time(tm_year=2014, tm_mon=11, tm_mday=10, tm_hour=1, tm_min=5, tm_sec=31, tm_wday=0, tm_yday=314, tm_isdst=0), 0)
>>> print p.parse("25th July")
((2015, 7, 25, 1, 5, 50, 0, 314, 0), 1)
>>> print p.parse("25th July 2014")
((2014, 7, 25, 1, 6, 3, 0, 314, 0), 1)

最后一件事是，您可以使用这个dataset来提取airports，并验证所提到位置的正确性，以防您用availability回答（有些位置没有airport）。在

对于类，您可以通过查看句子中的“经济舱”、“商务舱”单词来验证它（您可以在in或正则表达式之间进行选择）。在

有关此主题的详细信息，请检查：NTLK - Extracting Information from Text

网友

2楼 · 编辑于 2024-05-15 11:58:18

这个问题被称为“命名实体识别”（或简称“ner”）。你应该针对那些特定的数据库类型，比如，google上的一些特定的规则

在http://nlp.stanford.edu:8080/ner/处签出演示NER系统

检测对日期和时间的引用可能是最具启发性的解决方案。在

如果您使用的文本域是特定且非常有限的，那么设置手动编辑的实体列表可能会非常有用。
e、 g.只需列出所有机场代码/拥有商业机场的所有城市的名称，并尝试将这些名称与任何输入文本进行精确的字符串匹配。在

相关问题更多 >

编程相关推荐

热门问题

热门文章