Python NLTK:Bigrams trigrams-fourgrams

3条回答

网友

1楼 · 编辑于 2024-06-07 19:14:56

我是这样做的：

def words_to_ngrams(words, n, sep=" "):
    return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]

它接受单词的列表作为输入，并返回一个ngram列表（对于给定的n），由sep（在本例中为空格）分隔。

网友
2楼 · 编辑于 2024-06-07 19:14:56

如果你应用集合论（如果我正确地解释了你的问题），你会发现你想要的三元图只是token列表中的元素[2:5]、[4:7]、[6:8]等。
你可以这样生成它们：
>>> new_trigrams = [] >>> c = 2 >>> while c < len(token) - 2: ... new_trigrams.append((token[c], token[c+1], token[c+2])) ... c += 2 >>> print new_trigrams [('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]

网友
3楼 · 编辑于 2024-06-07 19:14:56

尝试everygrams：

from nltk import everygrams
list(everygrams('hello', 1, 5))

[出局]：

[('h',),
 ('e',),
 ('l',),
 ('l',),
 ('o',),
 ('h', 'e'),
 ('e', 'l'),
 ('l', 'l'),
 ('l', 'o'),
 ('h', 'e', 'l'),
 ('e', 'l', 'l'),
 ('l', 'l', 'o'),
 ('h', 'e', 'l', 'l'),
 ('e', 'l', 'l', 'o'),
 ('h', 'e', 'l', 'l', 'o')]

单词标记：

from nltk import everygrams

list(everygrams('hello word is a fun program'.split(), 1, 5))

[出局]：

[('hello',),
 ('word',),
 ('is',),
 ('a',),
 ('fun',),
 ('program',),
 ('hello', 'word'),
 ('word', 'is'),
 ('is', 'a'),
 ('a', 'fun'),
 ('fun', 'program'),
 ('hello', 'word', 'is'),
 ('word', 'is', 'a'),
 ('is', 'a', 'fun'),
 ('a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a'),
 ('word', 'is', 'a', 'fun'),
 ('is', 'a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a', 'fun'),
 ('word', 'is', 'a', 'fun', 'program')]

相关问题更多 >

编程相关推荐

热门问题

热门文章