什么是正确的分词算法?错误:TypeError:强制转换为Unicode:需要字符串或缓冲区,发现列表
我正在做一个信息检索的任务。在预处理的过程中,我想做以下几件事:
- 去除停用词
- 分词
- 词干提取(使用Porter词干算法)
一开始,我跳过了分词这一步。结果我得到了这样的词汇:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.
所以,现在我意识到分词的重要性。有没有什么标准的算法可以用来对英语进行分词?我想基于 string.whitespace
和常用的标点符号来写一个分词程序。
def Tokenize(text):
words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])
return [word.strip() for word in words if word.strip() != '']
- 我遇到了
TypeError: coercing to Unicode: need string or buffer, list found
的错误! - 这个分词程序有什么办法可以改进吗?
2 个回答
0
正如larsman提到的,nltk有很多不同的分词工具,可以接受各种选项。使用默认设置:
>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal', 'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']
如果你想过滤掉那些只有标点符号的列表项,可以这样做:
>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
'''
returns bool(set(s) is not a subset of set(filter_chars))
'''
return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)
返回结果
>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']
1
没有一种完美的算法可以用于分词,不过你的算法可能足够用于信息检索。使用正则表达式来实现会更简单:
def Tokenize(text):
words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
return [word.strip() for word in words if word.strip() != '']
这个算法可以通过多种方式进行改进,比如正确处理缩写:
>>> Tokenize('U.S.')
['U', 'S']
另外,要注意你对连字符(-
)的处理。比如:
>>> Tokenize('A-level')
['A', 'level']
如果你的停用词列表中有 'A'
或 'a'
,那么这将被简化为仅仅是 level
。
我建议你看看 用Python进行自然语言处理,第三章,以及 NLTK 工具包。