Python正则表达式nltk网站提取

2 投票

1 回答

2205 浏览

提问于 2025-04-17 03:49

你好，我之前从来没有接触过正则表达式，现在我想用Python和NLTK来处理一些原始文本。

当我尝试用以下代码把文档分词时：

tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x)  # set flag to allow verbose regexps
  ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
| \w+(-\w+)*            # words with optional internal hyphens
| \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+         # hashtags and @ signs
| \.\.\.                # ellipsis
| [][.,;"'?()-_`]       # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''

它无法把整个网站当作一个完整的词来处理：

print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp',   'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']

如果有人能帮忙，我将非常感激。非常感谢！

-Florie

正则表达式文本处理自然语言处理 nltk 分词

1 个回答

在这个分词器中，我们使用正则表达式来指定你想从文本中提取的词的样子。我有点困惑你用了上面那么多正则表达式中的哪一个，不过如果你只是想简单地把文本中的非空白部分提取出来，可以用：

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com"
>>> nltk.regexp_tokenize(corpus, r"\S+")
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

这个和下面的表达式是一样的：

>>> corpus.split()
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

另一种方法是使用nltk库中的函数sent_tokenize()和nltk.word_tokenize()：

>>> sentences = nltk.sent_tokenize(corpus)
>>> sentences
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com']
>>> for sentence in sentences:
    print nltk.word_tokenize(sentence)
['this', 'is', 'a', 'sentence', '.']
['and', 'another', 'sentence', '.']
['my', 'homepage', 'is', 'http', ':', '//test.com']

不过如果你的文本里有很多网址，这可能不是最好的选择。关于NLTK中不同分词器的信息可以在这里找到。

如果你只是想从文本中提取网址，可以使用这样的正则表达式：

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

希望这能帮到你。如果这不是你想要的答案，请尽量更清楚地解释一下你想做什么，以及你希望提取的词具体是什么样子的（比如你想要的输入和输出示例），这样我们可以帮助你找到合适的正则表达式。

回答于 2025-04-17 由 Python大师

分享举报

Python正则表达式nltk网站提取

1 个回答

撰写回答