什么是正确的分词算法?错误:TypeError:强制转换为Unicode:需要字符串或缓冲区,发现列表

0 投票
2 回答
1790 浏览
提问于 2025-04-16 06:20

我正在做一个信息检索的任务。在预处理的过程中,我想做以下几件事:

  1. 去除停用词
  2. 分词
  3. 词干提取(使用Porter词干算法)

一开始,我跳过了分词这一步。结果我得到了这样的词汇:

broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.

所以,现在我意识到分词的重要性。有没有什么标准的算法可以用来对英语进行分词?我想基于 string.whitespace 和常用的标点符号来写一个分词程序。

def Tokenize(text):
    words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])    
    return [word.strip() for word in words if word.strip() != '']
  1. 我遇到了 TypeError: coercing to Unicode: need string or buffer, list found 的错误!
  2. 这个分词程序有什么办法可以改进吗?

2 个回答

0

正如larsman提到的,nltk有很多不同的分词工具,可以接受各种选项。使用默认设置:

>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal',       'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']

如果你想过滤掉那些只有标点符号的列表项,可以这样做:

>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
        '''
        returns bool(set(s) is not a subset of set(filter_chars))
        '''
        return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)

返回结果

>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']
1

没有一种完美的算法可以用于分词,不过你的算法可能足够用于信息检索。使用正则表达式来实现会更简单:

def Tokenize(text):
    words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
    return [word.strip() for word in words if word.strip() != '']

这个算法可以通过多种方式进行改进,比如正确处理缩写:

>>> Tokenize('U.S.')
['U', 'S']

另外,要注意你对连字符(-)的处理。比如:

>>> Tokenize('A-level')
['A', 'level']

如果你的停用词列表中有 'A''a',那么这将被简化为仅仅是 level

我建议你看看 用Python进行自然语言处理,第三章,以及 NLTK 工具包。

撰写回答