pythonnltk解释一个固定的句子模式并标记i问题的回答

pythonnltk解释一个固定的句子模式并标记i

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在计算语言学中，这被称为“<a href="http://en.wikipedia.org/wiki/Named-entity_recognition" rel="nofollow noreferrer">Named Entity Recognition</a>”，它是从文本中识别组织、人和地点等事物的过程。在 这里的挑战是nltk中默认的NE chunker是在<a href="http://catalog.ldc.upenn.edu/LDC2005T09" rel="nofollow noreferrer">ACE corpus</a>上训练的最大熵chunker。它还没有被训练识别日期和时间，所以你需要调整它，找到一种检测时间的方法。在 有一些软件包可以帮助提取命名实体，Stanford-NER（Named Entity Recognizer）是目前最流行的命名实体识别工具之一，由Java实现。但是您可以通过下载包来使用它，并通过NLTK进行交互，NLTK提供了Stanford-NER的接口。在 您可以下载<a href="http://nlp.stanford.edu/software/stanford-ner-2014-06-16.zip" rel="nofollow noreferrer">Stanford Named Entity Recognizer version 3.4</a> 你在哪里找到斯坦福大学-内贾尔和分类器模型“全部”类distsim.crf.gz系列““ <pre><code>from nltk.tag.stanford import NERTagger def stanfordNERExtractor(sentence): st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanford-ner.jar') return st.tag(sentence.split()) stanfordNERExtractedLines = stanfordNERExtractor("New York") print stanfordNERExtractedLines #[('New-York', 'LOCATION')] </code></pre> 你也可以使用NTLK，你可以在<a href="http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford" rel="nofollow noreferrer">official document</a>上找到更多细节，检查一下这个要点<a href="https://gist.github.com/gavinmh/4735528/" rel="nofollow noreferrer">from Gavin</a> ^{pr2}$ <ul> <li>我们如何确定目的地？在区分了位置之后，您可能会面临识别用空格分隔的单词，或者区分来源和区别的问题。在</li> </ul> 最好编写一个正则表达式模式来标识源和目标。您可能在获取其他单词（如<code>"to get"</code>）时遇到问题，但是您已经确定了要从<code>st.tag</code>（“LOCATION”）验证的位置列表，或者，如果您使用了NTLK，您可以验证它是否是动词（“VB”/“NN”）。您还可以通过使用NLTK的UnigramTagger（）和BigramTagger（）来检查可能性，以便在“FROM”和“to”之后获取可以标识为位置的名称 <blockquote> <pre><code>import re text= "I want to go to New York from Atlanta, business class, on 25th July." destination= re.findall(r'.to.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text) source= re.findall(r'.from.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text) print source,destination </code></pre> </blockquote> <ul> <li>我们如何确定时间/日期？在</li> </ul> 如上所述，这是我们可以面对的问题之一，但我们可以使用正则表达式，如本文<a href="https://stackoverflow.com/questions/3809985/how-to-find-dates-in-the-sentence-using-nlp-regex-in-python">thread</a>所述。在 <pre><code>print re.findall( r"""(?ix) # case-insensitive, verbose regex \b # match a word boundary (?: # match the following three times: (?: # either \d+ # a number, (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional) | # or a month name (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*) ) [\s./-]* # followed by a date separator or whitespace (optional) ){3} # do this three times \b """, text) </code></pre> 输出： <pre><code>25th July 2014. </code></pre> 我们也可以使用<a href="http://labix.org/python-dateutil" rel="nofollow noreferrer">python-dateutil</a>或<a href="https://github.com/cnorthwood/ternip" rel="nofollow noreferrer">this</a>，而不是使用正则表达式。在 以防丢失部分，如年份或月份。我们可以使用parsedatetime包对此进行调整。在 检查这个快速的例子（你可以根据不同的场景调整它） <pre><code>>>> import parsedatetime >>> p = parsedatetime.Calendar() >>> print p.parse("25th this month") (time.struct_time(tm_year=2014, tm_mon=11, tm_mday=10, tm_hour=1, tm_min=5, tm_sec=31, tm_wday=0, tm_yday=314, tm_isdst=0), 0) >>> print p.parse("25th July") ((2015, 7, 25, 1, 5, 50, 0, 314, 0), 1) >>> print p.parse("25th July 2014") ((2014, 7, 25, 1, 6, 3, 0, 314, 0), 1) </code></pre> 最后一件事是，您可以使用这个<a href="http://ourairports.com/data/" rel="nofollow noreferrer">dataset</a>来提取airports，并验证所提到位置的正确性，以防您用availability回答（有些位置没有airport）。在 对于类，您可以通过查看句子中的“经济舱”、“商务舱”单词来验证它（您可以在<code>in</code>或正则表达式之间进行选择）。在 有关此主题的详细信息，请检查：<a href="http://www.nltk.org/book/ch07.html" rel="nofollow noreferrer">NTLK - Extracting Information from Text</a>

pythonnltk解释一个固定的句子模式并标记i

1 个回答

相关Python问题