我试着用这个教程对一个新项目中的文本进行分类:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html它帮助我们在类别树中为给定的文档自动选择一个合适的类别。在
但当我尝试创建循环时收到一个错误,这是我的分类器类的大部分:
import psycopg2
import psycopg2.extras
from sklearn.datasets import fetch_20newsgroups,load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
from random import randint
import settings
class Classifier(object):
# Set Naive Bayes classifier
nb_classifier = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
random = randint(2, 9)
def __new__(cls):
inst = object.__new__(cls)
return inst
# Constructor
def __init__(self):
# Start connection with database
db_settings = "host='{}' dbname='{}' user='{}' password='{}'".format(settings.DB_HOST, settings.DB_TARGET, settings.DB_USER, settings.DB_PASS)
self.conn = psycopg2.connect(db_settings)
self.cursor = self.conn.cursor()
print(randint(2, 9))
# Get categorized data from db for training purposes
def getCategories(self,parent):
if parent == 0:
self.cursor.execute("""SELECT "categories"."id", concat_ws(', ', products.name::text) AS ab FROM "products"
INNER JOIN "product_categories" ON "products"."id" = "product_categories"."product_id"
INNER JOIN "categories" ON "product_categories"."category_id" = "categories"."id"
WHERE "parent" = 0""")
else:
self.cursor.execute("""SELECT "categories"."id", concat_ws(', ', products.name::text) AS ab FROM "products"
INNER JOIN "product_categories" ON "products"."id" = "product_categories"."product_id"
INNER JOIN "categories" ON "product_categories"."category_id" = "categories"."id"
WHERE "categories"."id" IN (SELECT * FROM (
WITH RECURSIVE relevant_taxonomy AS (
SELECT id
FROM categories
WHERE id = %s
UNION ALL
SELECT categories.id
FROM categories
INNER JOIN relevant_taxonomy ON relevant_taxonomy.id = categories."parent"
)
SELECT id FROM relevant_taxonomy
) AS subtree WHERE subtree.id != %s);""", (parent,parent,))
return self.cursor.fetchall()
# Train a classifier with train-data
def train_classifier(self, classifier, train_data):
## train given classifier with given data
trained_classifier = classifier.fit(train_data.data, train_data.target)
return trained_classifier
这是分类文件,在这里我使用“分类器”类。 分类.py公司名称:
^{pr2}$我开始执行这个函数一次,方法是循环遍历新文档,如您在底部看到的(for doc in ['Loopschoen']:
),如您所见,我从没有父节点(0)的类别开始,它们是根节点。函数返回要将文档放入的类别。但这只是类别树的顶层,所以我尝试用这个新值再次循环函数(因此它尝试查找所选类别的子级),方法是再次返回函数。最后,当它找不到任何子类别时,它将返回最后一个类别。在
但是每次第二个循环因这个错误而失败。
错误:
ValueError: Found array with dim 46197. Expected 92394
循环是唯一的问题。因为第一个循环我收到一个分类号,2号。然后,如果我用classify(2,doc)
再次运行脚本,我将收到下一个类别,在运行4或5次之后,我将收到消息put document "Loopschoen" in term_taxonomy id 20
。所以,如果我反复运行脚本并更改值,它就可以工作了。但循环失败了。。。。在
有谁知道这个循环失败的原因吗?在
编辑1:
我们知道它在分类器类中失败:
trained_classifier = classifier.fit(train_data.data, train_data.target)
但我们不知道为什么。在
发现问题后,我不得不重置循环内的数组。只是给训练增加了价值_数据。数据所以这个号码和火车不一样_数据目标公司名称:
train_data = Traindata() train_data.data = []
它期待着火车_数据目标长度为80841,因为火车_数据。数据包含80841个项目(前一个循环中的项目也是如此)。在
相关问题 更多 >
编程相关推荐