机器学习文本分类

stoplist = set('le la les de des à un une en au ne ce d l c s je tu il que qui mais quand'.split()) stoplist.add('') splitters = u'; |, |\*|\. | |\'|' liste = (re.split(splitters, doc.lower()) for doc in alltxts) # generator = pas de place en memoire dictionary = corpora.Dictionary([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste) # bigrams print len(dictionary) stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id] once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq < 10 ] dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once dictionary.compactify() # remove gaps in id sequence after words that were removed print len(dictionary) liste = (re.split(splitters, doc.lower()) for doc in alltxts) # ATTENTION: quand le générator a déjà servi, il ne se remet pas au début => le re-créer pour plus de sécurité alltxtsBig = ([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste) corpusBig = [dictionary.doc2bow(text) for text in alltxtsBig]

1条回答

网友

1楼 · 发布于 2024-06-01 00:30:16

欢迎来到stackoverflow。首先，你确定你的表现很差吗？你甚至没有说你的表现如何，但是如果（就像你所说的那样）你试图根据一个句子来确定作者，我不认为这是可能的。作者识别通常是在更长的文本上完成的。在

恐怕您的代码都不完整（在哪里定义了gensim？所有这些库函数都做什么？）太长了，不容易跟上。但是，您是否使用文本中所有（非停止字）双元组的存在作为分类器的特征？这有很多特性，而且它们都是相同的（bigrams）。您可以尝试向混合中添加一些不同种类的特性，和/或更有选择地使用bigram特性，以避免过度训练。你应该多读一读，找出什么样的东西才有可能起作用——作家鉴定并不是一项新的任务。在

你的问题有点过于宽泛，无法有效地回答，因为可能的答案太多了。但当你在这方面做得更多的时候，要坚持住，问一些更具体的问题。祝你好运！在

相关问题更多 >

编程相关推荐

热门问题

热门文章