我必须处理两个文本文件,其中有几个来自酒店的评论。在每个评审旁边都有一个值,表示它是真实的评审还是欺骗性的评审。 为了处理测试和训练集,我有这部分代码:
import csv
x_train = list()
y_train = list()
with open('TRAINING_ALL.txt', encoding='utf-8') as infile:
reader = csv.reader(infile, delimiter='\t')
for row in reader:
x_train.append(row[0])
y_train.append(int(row[1]))
x_test = list()
y_test = list()
with open('TEST_ALL.txt', encoding='utf-8') as infile:
reader = csv.reader(infile, delimiter='\t')
for row in reader:
x_test.append(row[0])
y_test.append(int(row[1]))
然后我要用神经网络进行分类。但是,在加载数据部分,我陷入了困境:
^{pr2}$我得到:
Loading data...
480 train sequences
320 test sequences
Pad sequences (samples x time)
到目前为止还不错。它读取正确的序列号。那么错误是:
ValueError: invalid literal for int() with base 10: "ould take a quick dip in the pool. I toured the hotel as my niece is planning her wedding and just so happens to live close to the hotel. The ' Chagall Ballroom ', was elegant enough for such an occa
给这段代码正确的输入是什么?在
请注意,代码最初的工作原理如下(从imdb获取数据集):
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
也许x峎u train和x_test的格式不正确?在
当你从csv文件加载数据时,你也可以在第一行包含列名,你可以很容易地检查查看x_train和x_test中的第一个元素。如果是这样的话,你可以这样跳过第一行
相关问题 更多 >
编程相关推荐