我正在使用nltk
库的movie_reviews
语料库,其中包含大量文档。我的任务是通过对数据的预处理和不进行预处理来获得这些评论的预测性能。但问题是,在列表documents
和documents2
中,我有相同的文档,我需要对它们进行洗牌,以便在两个列表中保持相同的顺序。我不能分别洗牌,因为每次我洗牌列表,我会得到其他结果。这就是为什么我需要用相同的顺序同时洗牌,因为我需要在最后比较它们(这取决于顺序)。我使用的是Python2.7
示例(实际是标记化的字符串,但它不是相对的):
documents = [(['plot : two teen couples go to a church party , '], 'neg'),
(['drink and then drive . '], 'pos'),
(['they get into an accident . '], 'neg'),
(['one of the guys dies'], 'neg')]
documents2 = [(['plot two teen couples church party'], 'neg'),
(['drink then drive . '], 'pos'),
(['they get accident . '], 'neg'),
(['one guys dies'], 'neg')]
我需要在洗牌两个列表后得到这个结果:
documents = [(['one of the guys dies'], 'neg'),
(['they get into an accident . '], 'neg'),
(['drink and then drive . '], 'pos'),
(['plot : two teen couples go to a church party , '], 'neg')]
documents2 = [(['one guys dies'], 'neg'),
(['they get accident . '], 'neg'),
(['drink then drive . '], 'pos'),
(['plot two teen couples church party'], 'neg')]
我有这个密码:
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle( and here shuffle documents and documents2 with same order) # or somehow
你可以这样做:
当然,这是一个列表更简单的例子,但是对您的案例的改编将是相同的。
希望有帮助。祝你好运。
同时洗牌任意数量的列表。
输出:
注:
由
shuffle_list()
返回的对象是tuples
。附则。
shuffle_list()
也可以应用于numpy.array()
输出:
我有个简单的方法
相关问题 更多 >
编程相关推荐