将数据拆分为培训和测试

2024-06-02 05:03:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我想复制这个教程,用不同的数据集对两组https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/进行分类,但尽管我很难尝试,但还是做不到。我是新的编程,所以将感谢任何帮助或提示,可以帮助。你知道吗

我的数据集很小(每组240个文件),文件名为01-0240。你知道吗

我想是围绕着这些代码行。你知道吗

    if is_trian and filename.startswith('cv9'):
        continue
    if not is_trian and not filename.startswith('cv9'):
        continue

还有这些

            trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
            save_dataset([trainX,trainy], 'train.pkl')

            testY = [0 for _ in range(100)] + [1 for _ in range(100)]
            save_dataset([testX,testY], 'test.pkl')

到目前为止遇到了两个错误:

Input arrays should have the same number of samples as target arrays. Found 483 input samples and 200 target samples.

Unable to open file (unable to open file: name = 'model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

我真的很感激任何及时的帮助。你知道吗

提前谢谢。你知道吗

你知道吗// 代码的一部分。 //你知道吗

# load all docs in a directory
def process_docs(directory, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any transcript in the test set

我想在下面添加一个参数来指示是处理培训文件还是测试文件,就像教程中提到的那样。或者如果还有别的 好的,请分享

        if is_trian and filename.startswith('----'):
            continue
        if not is_trian and not filename.startswith('----'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc)
        # add to list
        documents.append(tokens)
    return documents

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')

# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]

save_dataset([testX,testY], 'test.pkl')

Tags: andthetoindocsfordocis
2条回答

您应该发布更多的代码,但听起来您的问题是如何管理数据。假设您在一个名为“health”的文件夹中有240个文件,在一个名为“sick”的文件夹中有240个文件。然后您需要用标签0标记所有健康人,用标签1标记所有病人。尝试以下操作:

from glob import glob 
from sklearn.model_selection import train_test_split

#get the filenames for healthy people 
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]

#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]

#get the filenames of sick people
xsick    = [ fname for fname in glob( 'pathToData/sick/*')]

#give sick people label of 1
ysick    = [ 1 for i in range( len( xsick ))]

#combine the data 
xdata = xhealthy + xsick 
ydata = yhealthy + ysick 

#create the training and test set 
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)

然后用X\u-train,Y\u-train训练你的模型,用X\u-test,Y\u-test测试它-记住你的X\u数据只是需要处理的文件名。你发布的代码越多,就有越多的人可以帮助你解决问题。你知道吗

我能够通过手动将数据集分离为训练集和测试集,然后单独标记每个集来解决这个问题。我目前的数据集太小了,所以一旦我有能力,我会继续为大型数据集寻找更好的解决方案。提供结束问题。你知道吗

相关问题 更多 >